Telephonic voice non-recognition or voice deactivation systems
You may think this is a joke. I don't, as you'll soon see!
Kafkaesque customer service just one example
Typical of most "intelligent" (=stupid) systems.
I
wanted to find out how to change my credit card details with my mobile phone
service provider
(Fido/Rogers). This is a transcriptiion of the phone call I made to them (Fido/Rogers)
around
11 a.m. on Wednesday
5 October, 2005. You can also check the call as an
audio recording
and in video form.
In
this tabular transcript, all the Fido voices are
pre-recorded..
My words or action are in green and dead
time is in orange.
|
Time
|
Who
|
Statement
or action
|
Comment
|
| 00:00 |
Me
|
Dial Fido | |
| 00:05 |
Fido
|
"Bienvenu(e) à Fido. For service in English, press '1', now." (pressed) | Request contradicted at 03:07 |
| 00:09 |
Fido
|
"Welcome to Fido. Please enter the ten digits of your Fido..." | Step repeated at 01:52 |
| 00:12 |
Me
|
Fido phone number entered | |
| 00:21 |
Fido
|
(= nothing happens, nothing is said) | Waiting: ½ second music, then silence; chaotic impression |
| 00:23 |
Fido
|
"Please note that when using this system you can either tell me what you want to do or you can dial the information using your phone key pad, like when you want to enter your phone number, for example. Using your keypad is suggested if I'm having trouble understanding you." | Very low volume, difficult to apprehend. Very wordy, contains tautology ('like",,, "for example"). Keypad subsequently unusable in all instances except input of phone number. |
| 00:36 |
Fido
|
| Waiting ... |
| 00:38 |
Fido
|
"Please hold while we access your account information." | Very loud indeed (if 0 dB, previous at -30dB), chaotic impression |
| 00:42 |
Fido
|
| Waiting ... |
| 00:52 |
Fido
|
"with your prepaid service", then nothing | Very low volume again, technologically erratic; and what does it mean? |
| 00:58 |
Fido
|
"As you use this service, I'm going to ask you some questions. Whenever you know the answer you can just interrupt me. You can still use your handset keypad for entering things like phone numbers or access codes." | This is very wordy and treats me like a primary school kid. It is also untrue because it demands "yes" or "no" answers (which it doesn't understand, see 01:09, 01:13, 01:17, 01:28) and doesn't allow keypad input.. |
| 01:05 |
Me
|
"I bloody well hope so." | Muttered very quietly |
| 01:06 |
Fido
|
| Waiting again... |
| 01.09 |
Fido
|
"I didn't get that. Please say 'yes' or 'no'". | Contradicts messages at 00:58 and at 01:17 |
| 01:13 |
Me
|
"No!" | Loud and clear, as instructed at 01:09 |
| 01:14 |
Fido
|
| Waiting for response... |
| 01:16 |
Me
|
"Jesus Christ!" | Muttered under breath in quiet desperation |
| 01:17 |
Fido
|
"Sorry, I still didn't understand." | Contradicts message at 01:09 and response at 01:13 |
| 01:19 |
Me
|
"Well, you're stupid." | Normal voice |
| 01:20 |
Fido
|
| Waiting... |
| 01:23 |
Fido
|
"Alright, then. Here's what I can help you with. You can say 'account information'..." | Not what I want to hear after another 3 seconds waiting for an intelligent response |
| 01:28 |
Me
|
"No" | As instructed at 01:09; no indication to the contrary: I "can" say "account information"; I don't have to. |
| 01:29 |
Fido
|
| Waiting... |
| 01:32 |
Fido
|
"I didn't get that." | Clueless: see 01:09, 01:13, 01:17, 01:28 |
| 01:34 |
Me
|
"Well, get a brain." | Normal voice, irritation growing |
| 01:36 |
Fido
|
| Still waiting for an intelligble response |
| 01:38 |
Fido
|
"Sorry, I still didn't understand." | Incompetent! |
| 01:42 |
Me
|
"Oh, get a brain, woman." | Normal voice, irritation turning to despondency |
| 01:43 |
Fido
|
| Still waiting for a sensible response |
| 01:45 |
Fido
|
"To help me route your call to the right person, please say or enter your ten-digit Fido number." | I already did this at 00:09 |
| 01:52 |
Me
|
"Done that already." Mobile number entered again | Waste of time |
| 02:01 |
Fido
|
| Still waiting for something productive to happen |
| 02:03 |
Fido
|
"I'm attempting to transfer your call. Thanks for being patient." | What a joke! A machine thanks me for being patient?... |
| 02:07 |
Fido
|
| Absolutely nothing for 64 seconds! At 02:55 I mutter "not even hold music this time." |
| 03:07 |
Fido
|
"Pour raisons d'assurance de qualité et d'information il se peut cet appel soit mis sur écoute ou enregistré." | Extremely low volume, very difficult to decipher. What happened to my choice of English? (00:05) |
| 03:14 |
Me
|
"Mais, j'avais choisi l'anglais." | Muttered very quietly |
| 03:16 |
Fido
|
| Low-volume hold music |
| 03:34 |
Fido
|
"The current wait time is five to ten minutes." | It has taken 3 minutes and 34 seconds to find this out! |
It took 3 minutes and 34 seconds to find out that I would have to wait another five to ten minutes before being able to speak to a human being. In fact I had to wait another twelve minutes. My question was only partly answered sixteen minutes after picking up the phone.
As the tabulated phone call transcript shows, dead time (waiting for a response or action from Fido) accounts for 57% of the calls total duration, a small but significant part of which is occupied by the voice recognition system attempting to match what I say based on the impoverished range of syllables it has been programmed to deal with. Otherwise, entering digits on the keypad accounts for about 10%, listening to Fidos preprogrammed messages 28% and my own statements or muttering the remaining 5%. Including the duplicated phone number input, my statements and actions occupy 15% and Fidos 85% of the time wasted before learning that I had to wait another ten minutes. Whos wasting whose time?
Apart from the personal inconvenience and irritation of being kept waiting, it is worth calculating what kind of effect a system like this has on the economy. For example, before the voice non-recognition system was introduced I could top up my credit card payment to Fido within about 60 seconds. To carry out the same task using the voice non-recongnition system takes at least three minutes. Two minutes extra per month means 24 extra minutes per year. If the company has only 10,000 users, that means the loss of 4,000 man hours per year. Would Fido, with its silly cuddly dog logo, like it if the companys unionised employees were to demand the right to an extra 4,000 man hours of paid absence (for illness, holidays, etc.) or if they had to waste that amount of time on the phone?
According to Fido, nearly 90% of calls made using their voice non-recognition system do actually get through. They hasten to add that the survey did not investigate if customers were satisfied with the system. After all, I got a partial answer after sixteen minutes (of which the first 214 seconds shown in the table, above) and belong to that 90%, so what's the problem, apart from wasting our time?
The most serious technical limitations are the systems: [1] inability to distinguish speech from background noise (the cocktail party effect); [2] inability to decipher other variants of English than those it has been programmed to deal with; [3] paucity and rigidity of possible answers to particular questions and the inability to discern threads in conversation.
No cocktail party skills. Humans can pick up and follow statements from fellow humans not only against background noise but also when several other conversations are going on at the same time. By way of contrast, Fido representatives advise customers to avoid any background noise (even fans and ventilation noise) if their speech is to be recognised by the system. Forget conversations going on in the background, traffic on the street, etc! What is the point of such a crude system? For example, I am currently unable, if Im downtown and, on discovering theres only one prepaid dollar left in my account, want to top it up without either seeking out a very quiet place (in downtown Montréal?!) or paying to log on at an internet café, if I can find one.
Linguistic ethnocentricity. English, as first or second language, is spoken more extensively worldwide than even Mandarin Chinese. Variants of spoken English are therefore innumerable. For example, in a multicultural city like Montréal, second-language English speakers preferring the English rather than French voice non-recognition system can have, as their first language, Arabic, Hindi, Urdu, Bengali, Tagalog, Greek, Russian, Polish, Portuguese, Italian, Cantonese, Mandarin, etc. The system even has problems with my (usually clear and correct) British English! Yet we all have to use a system which forces us all to speak with an accent that is not our own.
Paucity and rigidity. The system only accepts a very limited number of stock words and statements, to be enunciated in standard North American English (see above), as valid triggers for action on its part, and only in response to one particular prompt at a time. Synonyms, explanatory or polite turns of phrase and any other socially normal conversation strategies are right out of the question. Nor can the system be expected to follow the thread of the most simple dialogue, as can be seen from the fact that it wouldnt take no for an answer when a prompt five seconds earlier had instructed me to answer yes or no. The system doesnt even have the memory of a goldfish.
It is technically naïve to expect voice activation systems to deal with even the three basic points just mentioned which constitute the simplest examples of normal language behaviour among humans in any culture. So why not just accept that youre dealing with a stupid machine and do what it wants? The reason is a technical naïvety that is deaf and blind to basic realities of socio-linguistic behaviour
The pre-recorded voice speaks to you using the normal speech parameters of diction, intonation, inflexion, accenuation, rhythm and timing. It also presents you with short but complete sentences. You, on the other hand, must not respond using complete sentences or any diction, intonation, inflexion, accentuation, rhythm or timing that might confuse the system. While the machine presents the sound of a human voice some companies even give the machine a name and an ego (how sad is that!) you, the human, must respond to it like a machine. In short, the robot is humanised and you, the human, are robotised. It (machine) pretends to be what you are (human) and you (human) must pretend to be what it is (machine). This absurd perversion of everyday habits of conversation make communication virtually impossible. If you do not respond like a machine to the machine that you are dealing with, despite its human pretences (I, me, normal speech patterns, etc.), you will not pierce its stupidity and gain access to the information or service you require. It is, in other words, a matter of social power because the system can say whatever it likes to you in whatever way it chooses while you can only say what it accepts, no more, no less. It is by such mechanisms that authoritarianism and fascism work. If you get people used to behaving like machines in any way possible theyll be more likely, as the automata theyve been trained to become, to obey all orders and follow all policies without question. I object strongly, on ethical grounds, to any attempt to demean humans in this kind of way. I consistently tell Fido that I refuse to speak to a machine pretending to be human, and I tell them why. They think Im a trouble-maker. Wrong! Theyre causing the trouble and asking for more of it by forcing people to become machines. People are human beings and should be treated as humans, not robots, and machines should not pretend to be human.
According to normal socio-linguistic conventions in normal everyday circumstances, if someone speaking to you adopts a certain tone, and if that tone is acceptable to you, you will probably respond using a tone similar or acceptable to that of your interlocutor. Therefore, when a recorded voice, as in a voice activation system, addresses you using normal intonation, inflexion, rhythm and sentence construction, you are more likely than not to respond in the same way. It is for this reason that systems whose recorded messages require standardised vocal response should not sound human, the simple reason being that robotic voices are much more likely to elicit the kind of robotic response required than are human-sounding messages. One other solution is to revert to keypad input, yet another to employ more real humans to answer queries and to provide services, another to rethink the provision of services in terms of what customers really want, yet another to subscribe to more phone lines and to allow people more direct access to the particular services they require (see last paragraph, below).
It took me sixteen minutes to find out just part of what I needed to know from Fido, with their expensive voice non-recognition system. Compare that with the 28 seconds it took me to phone up and find the time of the next 161 bus going east from the stop at the corner of my street (click here to hear that!). Thats 34 times faster than Fido to access highly specialised information! Even though theres just one phone number for information covering the whole of the Montréal public transport system, theres no fuss: no menu, no voice recognition, no pseudo-human machine fetishes, no farting about at all. The volume is also reasonably constant and every statement totally comprehensible. What a relief! Thank you, STM! Fido should learn from you. Capitalism sucks. Public services rule!