You may think this is a joke. I don't... !
The stupidity of ‘voice recognition’ and ‘voice activation’
|
---|
Kafkaesque customer service just one example
Typical of most "intelligent" (=stupid) systems.
When I lived in Montréal, I
needed to find out how to change my credit card details with my
mobile phone
service provider
(Fido/Rogers). This is a reliable transcriptiion of the actual
phone call I made to them (Fido/Rogers)
in the morning of Wednesday
5 October, 2005.
You can also check the call as an
audio recording and in video form.
In
this tabular transcript, all the Fido voices are
pre-recorded..
My words or action are in green and dead
time is in orange.
Time |
Who |
Statement
or action |
Comment |
00:00 |
Me |
Dial Fido | |
00:05 |
Fido |
"Bienvenu(e) à Fido. For service in English, press '1', now." (pressed) | Request contradicted at 03:07 |
00:09 |
Fido |
"Welcome to Fido. Please enter the ten digits of your Fido..." | Step repeated at 01:52 |
00:12 |
Me |
Fido phone number entered | |
00:21 |
Fido |
(= nothing happens, nothing is said) | Waiting: ½ second music, then silence; chaotic impression |
00:23 |
Fido |
"Please note that when using this system you can either tell me what you want to do or you can dial the information using your phone key pad, like when you want to enter your phone number, for example. Using your keypad is suggested if I'm having trouble understanding you." | Very low volume, difficult to apprehend. Very wordy, contains tautology ('like",,, "for example"). Keypad subsequently unusable in all instances except input of phone number. |
00:36 |
Fido |
| Waiting ... |
00:38 |
Fido |
"Please hold while we access your account information." | Very loud indeed (if 0 dB, previous at -30dB), chaotic impression |
00:42 |
Fido |
| Waiting ... |
00:52 |
Fido |
"with your prepaid service", then nothing | Very low volume again, technologically erratic; and what does it mean? |
00:58 |
Fido |
"As you use this service, I'm going to ask you some questions. Whenever you know the answer you can just interrupt me. You can still use your handset keypad for entering things like phone numbers or access codes." | This is very wordy and treats me like a primary school kid. It is also untrue because it demands "yes" or "no" answers (which it doesn't understand, see 01:09, 01:13, 01:17, 01:28) and doesn't allow keypad input.. |
01:05 |
Me |
"I bloody well hope so." | Muttered very quietly |
01:06 |
Fido |
| Waiting again... |
01.09 |
Fido |
"I didn't get that. Please say 'yes' or 'no'". | Contradicts messages at 00:58 and at 01:17 |
01:13 |
Me |
"No!" | Loud and clear, as instructed at 01:09 |
01:14 |
Fido |
| Waiting for response... |
01:16 |
Me |
"Jesus Christ!" | Muttered under breath in quiet desperation |
01:17 |
Fido |
"Sorry, I still didn't understand." | Contradicts message at 01:09 and response at 01:13 |
01:19 |
Me |
"Well, you're stupid." | Normal voice |
01:20 |
Fido |
| Waiting... |
01:23 |
Fido |
"Alright, then. Here's what I can help you with. You can say 'account information'..." | Not what I want to hear after another 3 seconds waiting for an intelligent response |
01:28 |
Me |
"No" | As instructed at 01:09; no indication to the contrary: I "can" say "account information"; I don't have to. |
01:29 |
Fido |
| Waiting... |
01:32 |
Fido |
"I didn't get that." | Clueless: see 01:09, 01:13, 01:17, 01:28 |
01:34 |
Me |
"Well, get a brain." | Normal voice, irritation growing |
01:36 |
Fido |
| Still waiting for an intelligble response |
01:38 |
Fido |
"Sorry, I still didn't understand." | Incompetent! |
01:42 |
Me |
"Oh, get a brain, woman." | Normal voice, irritation turning to despondency |
01:43 |
Fido |
| Still waiting for a sensible response |
01:45 |
Fido |
"To help me route your call to the right person, please say or enter your ten-digit Fido number." | I already did this at 00:09 |
01:52 |
Me |
"Done that already." Mobile number entered again | Waste of time |
02:01 |
Fido |
| Still waiting for something productive to happen |
02:03 |
Fido |
"I'm attempting to transfer your call. Thanks for being patient." | What a joke! A machine thanks me for being patient?... |
02:07 |
Fido |
| Absolutely nothing for 64 seconds! At 02:55 I mutter "not even hold music this time." |
03:07 |
Fido |
"Pour raisons d'assurance de qualité et d'information il se peut cet appel soit mis sur écoute ou enregistré." | Extremely low volume, very difficult to decipher. What happened to my choice of English? (00:05) |
03:14 |
Me |
"Mais, j'avais choisi l'anglais." | Muttered very quietly |
03:16 |
Fido |
| Low-volume hold music |
03:34 |
Fido |
"The current wait time is five to ten minutes." | It’s taken 3 minutes 34 seconds to find this out! |
It took 3 minutes and 34 seconds to find out that I would have to wait another five to ten minutes before being able to speak to a human being. In fact I had to wait another twelve minutes. My question was only partly answered sixteen minutes after picking up the phone. As the tabulated phone call transcript shows, dead time (waiting for a response or action from Fido) accounts for 57% of the calls total duration, a small but significant part of which is occupied by the voice recognition system attempting to match what I say based on the impoverished range of syllables it has been programmed to deal with. Otherwise, entering digits on the keypad accounts for about 10%, listening to Fidos preprogrammed messages 28% and my own statements or muttering the remaining 5%. Including the duplicated phone number input, my statements and actions occupy 15% and Fidos 85% of the time wasted before learning that I had to wait another ten minutes. Whos wasting whose time? Apart from the personal inconvenience and irritation of being kept waiting, it is worth calculating what kind of effect a system like this has on the economy. For example, before the voice non-recognition system was introduced I could top up my credit card payment to Fido within about 60 seconds. To carry out the same task using the voice non-recongnition system takes at least three minutes. Two minutes extra per month means 24 extra minutes per year. If the company has only 10,000 users, that means the loss of 4,000 man hours per year. Would Fido, with its silly cuddly dog logo, like it if the companys unionised employees were to demand the right to an extra 4,000 man hours of paid absence (for illness, holidays, etc.) or if they had to waste that amount of time on the phone? According to Fido, nearly 90% of calls made using their voice non-recognition system do actually get through. They hasten to add that the survey did not investigate if customers were satisfied with the system. After all, I got a partial answer after sixteen minutes (of which the first 214 seconds shown in the table, above) and belong to that 90%, so what's the problem, apart from wasting our time? |
The most serious technical limitations are the systems: [1] inability to distinguish speech from background noise (the cocktail party effect); [2] inability to decipher other variants of English than those it has been programmed to deal with; [3] paucity and rigidity of possible answers to particular questions and the inability to discern threads in conversation. No cocktail party skills. Humans can pick up and follow statements from fellow humans not only against background noise but also when several other conversations are going on at the same time. By way of contrast, Fido representatives advise customers to avoid any background noise (even fans and ventilation noise) if their speech is to be recognised by the system. Forget conversations going on in the background, traffic on the street, etc! What is the point of such a crude system? For example, I am currently unable, if Im downtown and, on discovering theres only one prepaid dollar left in my account, want to top it up without either seeking out a very quiet place (in downtown Montréal?!) or paying to log on at an internet café, if I can find one. Linguistic ethnocentricity English, as first or second language, is spoken more extensively worldwide than even Mandarin Chinese. Variants of spoken English are therefore innumerable. For example, in a multicultural city like Montréal, second-language English speakers preferring the English rather than French voice non-recognition system can have, as their first language, Arabic, Hindi, Urdu, Bengali, Tagalog, Greek, Russian, Polish, Portuguese, Italian, Cantonese, Mandarin, etc. The system even has problems with my (usually clear and correct) British English! Yet we all have to use a system which forces us all to speak with an accent that is not our own.Paucity and rigidity. The system only accepts a very limited number of stock words and statements, to be enunciated in standard North American English (see above), as valid triggers for action on its part, and only in response to one particular prompt at a time. Synonyms, explanatory or polite turns of phrase and any other socially normal conversation strategies are right out of the question. Nor can the system be expected to follow the thread of the most simple dialogue, as can be seen from the fact that it wouldnt take no for an answer when a prompt five seconds earlier had instructed me to answer yes or no. The system doesnt even have the memory of a goldfish.It is technically naïve to expect voice activation systems to deal with even the three basic points just mentioned which constitute the simplest examples of normal language behaviour among humans in any culture. So why not just accept that youre dealing with a stupid machine and do what it wants? The reason is a technical naïvety that is deaf and blind to basic realities of socio-linguistic behaviour The pre-recorded voice speaks to you using the normal speech parameters of diction, intonation, inflexion, accenuation, rhythm and timing. It also presents you with short but complete sentences. You, on the other hand, must not respond using complete sentences or any diction, intonation, inflexion, accentuation, rhythm or timing that might confuse the system. While the machine presents the sound of a human voice some companies even give the machine a name and an ego (how sad is that!) you, the human, must respond to it like a machine. In short, the robot is humanised and you, the human, are robotised. It (machine) pretends to be what you are (human) and you (human) must pretend to be what it is (machine). This absurd perversion of everyday habits of conversation make communication virtually impossible. If you do not respond like a machine to the machine that you are dealing with, despite its human pretences (I, me, normal speech patterns, etc.), you will not pierce its stupidity and gain access to the information or service you require. It is, in other words, a matter of social power because the system can say whatever it likes to you in whatever way it chooses while you can only say what it accepts, no more, no less. It is by such mechanisms that authoritarianism and fascism work. If you get people used to behaving like machines in any way possible theyll be more likely, as the automata theyve been trained to become, to obey all orders and follow all policies without question. I object strongly, on ethical grounds, to any attempt to demean humans in this kind of way. I consistently tell Fido that I refuse to speak to a machine pretending to be human, and I tell them why. They think Im a trouble-maker. Wrong! Theyre causing the trouble and asking for more of it by forcing people to become machines. People are human beings and should be treated as humans, not robots, while machines should not parade as humans. It’s childishly dishonest (and a waste of time) to pretend otherwise. If you follow the normal socio-linguistic conventions of everyday speech and if someone talks with a friendly voice you’ll probably respond using in a similar manner. That’s standard human behaviour. Therefore, when a recorded voice, as in a voice ‘activation’ system, addresses you using normal intonation, inflexion, rhythm and sentence construction, you’ll naturally respond in a similar way. That’s why systems whose recorded messages require a severely restricted range of standardised vocal responses from the customer should not sound human. It should be obvious that a robotic voice is much more likely than a human-sounding voice to elicit the kind of robotic response the system in fact needs to work at all. One other solution is to revert to keypad input, yet another to employ more real humans to answer queries and to provide services, another to rethink the provision of services in terms of what customers really want, yet another to subscribe to more phone lines and to allow people more direct access to the particular services they require (see last paragraph, below). It took me sixteen minutes to find out just part of what I needed to know from Fido, with their expensive voice non-recognition system. Compare that with the 28 seconds it took me to phone up and find the time of the next 161 bus going east from the stop at the corner of my street (click here to hear that!). Thats 34 times faster than Fido to access highly specialised information! Even though theres just one phone number for information covering the whole of the Montréal public transport system, theres no fuss: no menu, no voice recognition, no pseudo-human machine fetishes, no farting about at all. The volume is also reasonably constant and every statement totally comprehensible. What a relief! Thank you, STM! Fido should learn from you. Capitalism sucks. Public services rule! |
A classic example of corporate phone contact mismanagement.
Here’s why you need to phone the company.
You’re having problems re-registering a paid-for item of software bought from Exasperia. Despite searches among Exasperia’s chaotically posted FAQs, most of which are left unanswered by other users and are misleadingly referred to as ‘Help’, you just can’t find which of those variable names apply to which strings of alphanumeric characters for which purposes in your dealings with the company. There’s no explanation of how those codes and numbers relate to your User-ID, user name, password, access key, customer number, account number, account type, account status, verification details, etc., etc. Do they mean different things by all of those names for keys, codes, numbers, licenses, identities, types, status and passwords? If not, which variable names are alternatives to which others and what functions do they have? It’s not clear. The only way to find out is to phone Exasperia. This is what happens. |
What you do or think is in this font. What you actually say is in this font.
What you hear from Exasperia is in this font. Explanations/comments are in this font.
This call could have been much worse. At around 06:00 you could have pressed 7 for ancillary services and been presented with a submenu saying “You now have another five options to help us better route your call. Please listen carefully as our options may have changed.” Let’s say you choose 5 for “all other enquiries”. You’re transferred to a subsubmenu where you select another option which shunts you round in a circle back to the main menu or to the first submenu. Sound familiar? It’s happened to me several times. Talking to a human being is what most customers want to do but it’s the last thing the system offers you. You’re just a pain in the corporate arse who has the gall to expect time to be spent attending to problems that are usually of the corporation’s own making. If ‘your call is important to us’ really meant ‘your call is important to us’ you wouldn’t have to wait so long and it wouldn’t be so hard to reach a human being. However frustrating and cynical this corporate phone treatment may be, never take it out on the human being you might eventually reach if you’re lucky and persistent enough. Just tell your fellow human slaving away in the call centre in Dundee or Delhi how pleased you are that they’re human and ask them to pass on your frustrations to management. If enough of us complain, who knows, it may have some effect one day or another. Until then neither you nor I count. Our time and patience is not an issue unless we insist that it is. Your call is not important to the corporation because you’re really a pain in their arse. To learn more about these stupidites, watch this short report from Australian TV. Try also this parody on voice ‘recognition’ (“I’m Phil. Did you say ‘Tom Jones’?”). |
It’s an exasperating waste of my time as a customer. It’s based on a whole set of misconceptions, on short-term greed and on sloppy thinking. Corporations need to:
|