Sunday, May 30, 2010

Un-rest in Peas: The Unrecognized Life of Speech Recognition (or “Why we do not have HAL 9000 yet”)

I read Robert Fortner’s blog post on the death of speech recognition  as soon as it came out, about a month ago. For several reasons, it took me a while to decide to craft a response, not last my eternal laziness and procrastination attitude. But having a few hours made available from an inspired jet-lagged insomnia I decided to go ahead and write about it.

I have to admit that speech recognition research and the use of it has occupied more than half of my life. So, I am personally mentally and sentimentally attached to speech recognition. But at the same time I am frustrated, disappointed, disillusioned at times.  Speech recognition is a great technology, I made a career out of it, but yes…it does make mistakes.

What I felt when I read Fortner’s post was no surprise. We all may feel that speech recognition did not keep up with its promises. Even us, who have been working on it and with it for decades, sometimes feel that sense of failure of an unrealized dream.   But writing eulogies for an undead person is not fair. First of all speech recognition--the undead one--is not, and does not want to be, what laypeople think it is. If you think that speech recognition technology, after 50 years of so of research, would bring us HAL 9000, you are right to think it is dead. But that type of speech recognition was never alive, except in the dreams of science-fiction writers. If you think that that’s the speech recognition we should have strived for, yes … that’s dead. But I would say that that dream was never alive in any reasonable way for most of the speech scientists—so they call us geeks who have dedicated time and cycles to making computer recognizing speech. We all knew that we would never see a HAL 9000 in our lifetimes.

Saying that speech recognition is dead because its accuracy falls far short of HAL-like levels of comprehension is like saying that aeronautical engineering is dead because commercial airplanes cannot go faster than 1,000 miles per hour, and by the way … they cannot get people to the moon. Similarly we can say that medicine is dead because we cannot always cure cancer, or that computer science is dead because my PC gets jammed and I have to reboot it now and then. There are limitations in any one of our technologies,   but the major limitations we perceive are the result of our false assumptions of what the goals are, our wrong use of the technology, and the wrong promises divulged by pseudoscientific press and media. Speech recognition is not about building HAL 9000. Speech recognition is about building tools, and as all tools, it may be imperfect.  Our job is trying to find a good use of an imperfect, often crummy, tool that can sometimes make our life easier.

Robert Fortner’s  blog post captures some truths about speech recognition, but it is more like a collection of data read here and there outside of the context. I would probably make the same mistakes if I read a few papers on genetic research and tried to write a critique of the field. For instance, among many other things, it is not accurate to say that the accuracy of speech recognition flat- lined in 2001 before reaching human levels.  It is true that “some” funding plugs were pulled—mainly the DARPA funds on interactive speech recognition projects mostly devoted to dialog systems. But 2001 was a year of great dramatic changes for many things. 9/11 brought the attention of the funding agencies on some more urgently important tasks than that of talking to computers for making flight reservations.  The funding plug on speech recognition technology was not pulled, but the goals were changed. For instance, DARPA itself started a quite large project, called GALE (as in Global Autonomous Language Exploitation) one of which goals was to interpret huge volumes of speech and text in multiple languages. And of course, the main purpose of that was for homeland security. The amount of audio information available today—Web, broadcasts, recorded calls, etc.—is huge even compared with the amount of text. Without speech recognition there is no way we can search through it, so all potentially useful information is virtually unavailable to us. And that is worsened by the fact that the vast majority of the audio available around us is not in English, but in many other languages for which a human translator may not be at hand when and where we want it. Now, even an imperfect tool such as speech recognition, associated to an imperfect machine translation, can still give us a handle to search and interpret vast amounts of raw audio. An imperfect transcription followed by an imperfect translation can still give us a hint on what’s that about and maybe help us select a few audio samples to be listened to or translated by a professional human translator. That’s better than nothing, and even a small imperfect help can be better than no help at all. GALE, and similar projects around the world gave rise to new goals for speech recognition and to a wealth of research and studies which is reflected in the rising number of papers at the major international conferences and journals. Large conferences like ICASSP and Interspeech, like the dozens of specialized workshops, attract thousands of researchers around the world every year. And the number is not declining. Other non-traditional uses of speech recognition and speech technology in general emerged like, to cite a few, emotion detection, or even the recognition of deception through speech analysis, which proves to be more accurate than human’s  (apparently computers can detect layers from their speech better than parole officers, who do  a measly 42%...)
The number of scientific papers on speech recognition—and not just HAL-like speech recognition—rose continuously since scientists and technologists started to look at that. The following chart shows the number of speech recognition papers (dubbed ASR, as in Automatic Speech Recognition) as a fraction of the total number of papers presented at the ICASSP conference from 1978 to 2006:
Kindly made available by Prof. Sadaoki Furi, from the Tokyo Institute of Technology
The number kept growing after 2006, and it shows similar figures for other conferences. So speech recognition is not dead.

Just to cite another inaccuracy of Fortner’s blog post—one that particularly touched me—is the mention of the famous sentence “Every time I fire a linguist my system improves” said—and personally confirmed—by one of the fathers of speech recognition, then the head of speech recognition research at IBM.  The meaning of that famous, or in-famous—sentence is not a conscious rejection of the deeper dimensions of language. Au contraire. It is the realization that classic linguistic research, based on rules and models derived by linguist’s introspection can only bring you so far. Beyond that you need data. Large amounts of data. And you cannot deal with large amounts of data by creating more and more rules in a scholarly amanuensis manner; you need some powerful tool that can extract information from larger and larger amounts of data without an expert linguist having to look at it bit by bit. And that tool is statistics. In the long run statistics showed to be so powerful that most linguists became statisticians. The preponderance of statistics in linguistics today is immense. It is enough to go to a conference like ACL (the annual conference of the Association for Computational Linguistics) to see the amount of topics that used to be  approached by traditional linguistics and are now the realm of statistics, mathematics, and machine learning. We would not have Web search if we approached it with traditional linguistic methods. Web search makes mistakes, but yet it is useful. We would not have the Web if we did not have statistically-based Web search. We would not have speech recognition, nor machine translation and many other language technologies, with all their limitations, if we did not abandon the traditional linguistic way and embraced the statistical linguistic way.  And by the way … that famous speech recognition scientist (whose name is Fred Jelinek, for the records) gave a talk in 2004 entitled “Some of my best friends are linguists.

Let’s talk now about speech recognition accuracy. The most frustrating perception of speech recognition accuracy—or the lack of it—is when we interact with its commercial realizations: dictation and what people of the trade call IVR, or Interactive Voice Response.

Half of the people who commented on Fortner’s blog are happy with automated dictation and have been using it for years. For them the dictation tool was well worth the little time spent in learning how to use it and training it. It is also true that many people tried dictation and it did not work for them.  But most likely they were not motivated to use it. If you have a physical disability, or if you need to dictate thousands of words every day, most likely speech recognition dictation will work for you. Or better, you will learn how to make it work. Again, speech recognition is a tool. If you use a tool you need to be motivated to use it, and learn how to use it.  And I repeat it here…this is the main concept, the take away. Speech recognition is a tool built by engineers, not an attempt to replicate human intelligence. Tools should be used when needed. We need a little patience, use them for the goal they were designed for, and they can help us.

And let’s get now to IVR systems, those you talk to on the phone when you would like instead to talk to a human; with no doubt they are the most pervasive, apparent, and often annoying manifestation of speech recognition and its limitations. They are perceived so badly that even Saturday Night Live makes fun of them. There is even a Web site,, which regularly publishes a cheat sheet to go around them and get a human operator right away. But are they so bad?  After all, thanks to speech recognition, hundreds of thousands of people can get up to the minute flight information  right away by calling the major airlines, or make flight and train reservation, get information from their bank accounts, and even get a call 24 hours before their flight and check in automatically even if they are not connected to internet. Without speech recognition  that would require hundreds of thousands of live operators— an unaffordable cost for the service providing companies—and tens of minutes   waiting in queue listening to wait music for their customers. Yet these systems make irritating mistakes, sometimes.  But they turn out to be useful for most of the people, hundreds of thousands of them. And I go again with leitmotif. They are tools, and tools can be useful when used properly and when truly needed. 

In 2001 recognition accuracy topped out at 80%, far short of HAL-like levels of comprehension. What does recognition accuracy mean? As all measures do, it means nothing outside a context.  Recognition accuracy is measured in different ways, but it most of the cases it measure how many words does the recognizer get wrong as a percentage of all words recognized. But that depends on the context. In IVR systems speech recognizer can get very specialized. The more they hear, the better they are. I work for a company called SpeechCycle.  We build sophisticated speech recognition systems that help the customers of our customers—typically service providers—get support and often solve problems. Using statistics and lots of data we have seen the accuracy of our speech recognition interactive systems grow continuously in time and get better and better as the speech recognizer learned from the experience of thousands of callers.  I am sure other companies that build similar system—our competitors--would claim a similar trend (although I am tempted to say that we do a little better than them …).  As Jeff Foley—a former colleague of mine—said in his beautifully articulated answer to Fortner’s blog post “[…] any discussion of speech recognition is useless without defining the task […].”  And I especially like Jeff’s hammer analogy: This is like saying that the accuracy of a hammer is only 33%, since it was able to pound a nail but failed miserably at fastening screws and stapling papers together. 
In specialized tasks speech recognition can get well above the 80% accuracy mentioned in Fortner’s blog post, which refers to a particular context, that of  for a very large vocabulary open dictation task.  By automatically learning from data acquired during the lifecycle of a deployed speech recognizer you can get to the 90s on average and to the high 90s in specially constrained tasks (if you are technically inclined you can see, for instance, one of our recent papers). With that you can build useful applications. Yes, you get mistakes now and then, but they can be gracefully handled by a well designed Voice User Interface.  In particular, recognition of digits, and command vocabularies, like yes and no today can go beyond 99% summing up to one error every few thousands of entries, which can be automatically corrected if other information is taken into account, like for instance the checksum on your credit card number. Another thing to take into consideration is that most errors are not because speech recognition sucks, but because people, especially occasional users, do not use speech recognition systems for what they were designed. Even if they are explicitly and kindly asked to respond with a simple yes or no, they go and say something else, like if they were talking to an omnipotent human operator. It is like if you went to an ATM machine, and enter the amount you want to cash in when you are asked for your pin … and then you complain because the machine did not understand you! I repeat it again: speech recognition is a tool, not HAL 9000, and as such users should use it for what it is designed for and follow the instructions provided by the prompts; it won’t work well otherwise.

And finally I would like to mention the recent surge of speech recognition applications for voice search and control on smartphones (Google, Vlingo, Bing, Promptu, and counting). We do that at SpeechCycle too, in our customer care view of the world. This is a new area of application of speech recognition which is likely to succeed because, in certain situations,   speaking to a smartphone is certainly more convenient than typing on it. The co-evolution of vendors who will constantly improve their products, and motivated users who will constantly learn how to use voice search and cope with its idiosyncrasies  (like we do all the times with PCs, word processors, blog editing interfaces, and every imperfect tool we are using to accomplish a purpose) will make speech recognition a transparent technology. A technology we use every day and we are not anymore aware of, like the telephone, the mouse, internet, and everything else that makes our life easier.


Mirabai Knight said...

Beautifully written, well-reasoned, and accurate. I'm going to point people to this blog post when they ask me what speech recognition can and cannot do. As I said in my "Voice versus CART" article (, "Computerized transcription is a fantastic technology, and it's particularly useful for those who can speak easily but find typing difficult." It's done an enormous amount of good for many people with disabilities, and I never want to devalue that. On the other hand, it's not a good substitution for human-powered realtime transcription, and, as you said, people continue to expect a HAL-like level of comprehension from computers that just isn't realistic, leading them to use SR technology in applications for which it isn't suited. Thank you for explaining both its value and its limitations so reasonably and cogently.

Jan van Santen said...

Roberto -- well put. I may also add that a new space of ASR applications is growing in the biomedical field, again by defining tasks that are exactly right given ASR's current capabilities.

Virtual Assistant Services said...

Hi, this is Jeff D Marsh with Yantram - Transcription services company, Transcription Services, video, Interview, Online, Digital, Medical Transcription Services,
Medical transcription outsourcing, Voice Transcription Services, Online Audio Transcription, Legal Transcription Services,
Seminar Transcription Services, Digital Dictation, Podcast Transcription, Audio Typing Service,online Transcription Services
,UK,USA,Australia Canada

To have better idea about the same just click on the link given hereby at Medical Transcription Services
I like your work.

Emmett Coin said...

Thanks Roberto for the informed rebuttal to the "Hal" blog post.

Part of the problem with the perception of ASR performance is that pure ASR technology is different from what most people think it is. Most people, even many speech industry people, think it is hearing the sounds and transcribing the words. But, for humans, recognizing perfectly is not an isolated task from understanding. And understanding an utterance is not independent of the context of a dialog. In our brains they are all interconnected in both forward and backward directions. In a human the words you think you recognize improve the probabilities of what you can understand AND what you think you can understand improves the probabilities of what you can recognize. And when we increase the scope to a conversation then we change the probabilities of where the conversation is going and feed that back to adjust the probabilities of the range of things we will try to understand and that changes the pool of words we will try to recognize.

Human-Computer conversation will always fall short if we don't give the computer the same skills. You don't have to look at the interconnections of the cerebral cortex and note that it is wired that way. You can just think about previous conversations when you misunderstood something (word, meaning, or direction) and figured it out a second later by re-parsing the speech (from memory) in a new context. Or, a couple seconds later understood a previous statement after you "got the point" of the conversation ... and then later you "recognized" that word that confused you.

I believe that ASR is about as good as it is going to get using "solely acoustic" input.

SLMs help a bit, but they are only a non-dynamic, one-time, predetermined feedback to the acoustic (ASR) layer. Ideally a new, adapted SLM should be regenerated after every utterance.

Even a good system like Dragon, which is great for generic business like prose, is nearly useless for creative writing and is doomed for writing poetry. It doesn't "get it". People have a hope of "getting it" because they integrate all the layers.

I think we need to start working on the "whole" problem before we see a "next level" of speech recognition.

We need to look at the sloppy (beautifully organic?) way that people learn language. How 6 month old babies learn to identify and cluster phonemes without any comprehension of what it "means"? Why a five year old can carry on a rich, engaging conversation but will be tormented and fail if asked to analyze (or even recognize) the grammatical structure they use?

RE lot's of data:
Do children experience even a tiny fraction of the speech corpus we use for present day ASR? They typically have only a handful of subjects (family, friends) to sample before they are quite good at understanding/generating speech (not millions of subjects). They also can quickly learn a strange new accent/manner (old uncle Fritz from Germany or Elmer Fudd in the cartoons or even Elmer Fudd with a German accent) very quickly (in minutes or less). Clearly we are still missing a huge part of how humans do it.

ps These expectations are all Kubrick's fault for making HAL the most human character in the movie!

Ahlir said...

Great post, Roberto!

I also read Fortner’s blog post at the time it came out, but I stopped taking it seriously once I saw his misleading rendition of the NIST evaluations graph. Fortner pulls out a single line, Switchboard, and claims that it represents progress in the entire field. He deliberately deletes those domains that achieved error rates below 5% and he deliberately deletes more recent (better) results on comparably difficult domains such as BN. The simple truth is that Switchboard stalled because sponsors lost interest in that domain. Progress continued along other paths.
All things considered, Fortner is rather lazy and, in my opinion, a rather glib charlatan; too bad he's managed to gather a gullible audience around himself.

It's been a while and no doubt my memory has gown hazy, but I seem to recall being present when Fred uttered his immortal words. I can still hear them... "each time I fire a linguist and hire a statistician, my error rate goes down". The occasion was a DARPA PI meeting in San Diego (I forget the program, but I do vividly remember it as a particularly wretched meeting). Of course, with a mot that bon, I wouldn't be surprised if by that time it wasn't already part of his standard repertoire.

Roberto is correct in saying that current ASR is a great engineering solution to many important problems. The problem is that people continue to confuse speech and language, which is why you hear them complain that ASR is not HAL. HAL did language, not (just) ASR. (True) language is inseparable from human intelligence, and we're obviously not there yet. If you're not happy about this state of affairs, go blame those (clueless, alas) AI people.

-- Alex Rudnicky

gawi said...

I agree that most of the bad reputation of speech recognition comes from IVR:

1- Users are not using it willingly. A certain percentage of people simply refuse to use it. This is where the "it's just a tool" argument stops. Because people do not have the choice to use it rather than talk to a person they are entitled to have a high level of expectation.

2- They feel they are not treated with due respect. They don't feel that they are valued customers.

3- The user experience is bad when IVR applications are not optimized (grammars, confidence processing, dialog, prompts). Note that speech synthesis can only make all this worse.

4- Some users are not tech-savvy and therefore not comfortable with the idea of talking to a machine.

The sad part is that designing/developing/deploying a speech-enabled IVR application required a lot of effort ($) and expertise.

NOTE: HAL was way more impressive for its speech synthesis or artificial intelligence skills (or lips reading capacity!) than its speech rec ones:

User: "Open the pod bay doors please HAL"
System: (noinput detected)
User: "Open the pod bay doors please HAL" [retry]
System: (noinput detected)
User: "Hello HAL do you read me?"
System: (noinput detected)
User: "Hello HAL do you read me?" [retry]
System: (noinput detected)
User: "Do you read me, HAL?" [rewording attempt]
System: (noinput detected)
User: "Do you read me, HAL?" [retry]
System: (noinput detected)
User: "Hello HAL do you read me?" [back to previous wording]
System: (noinput detected)
User: "Hello HAL do you read me?" [retry]
System: (noinput detected)
User: "Do you read me HAL?" [rewording, again]
System: "Affirmative Dave, I read you." [finally!]
User: "Open the pod bay doors HAL?"
System: "I'm sorry Dave, I'm afraid I can't do that." [nomatch... argh!]

And all this in a perfect deep-space noise-less environment...

See for yourself

Jon said...

Nice work, Roberto. You have said everything I would have said with only once exception. Why would we want HAL-like performance? Who wants to be murdered in space?

kurtgodden said...

Small point: The DARPA funding for speech was not halted in 2001. I was working on the DARPA Communicator program in 2001, all of 2002, and part of 2003. Not sure when it ended, but it was certainly after March 2003.

Steve Hubbard said...

In the court reporting captioning markets, the future appears to be bright, not true. Court reporting schools are closing. Historically only 5% of any court reporting program ever devised graduate as general court reporters to work in the courts as stenographers and a lesser number ever make it to live real time closed captioning for TV networks. Also, those humans who got into closed captioning as real time live steno captioners 30 years ago, they will all be retiring in the next few years. What TV networks are going to have to deal with is the sudden overnight disappearance of qualified captioners and no one to keep those TV captioners' chairs warm in the broadcasting booths. This means ASR will necessarily have to become the de facto real time multi-speaker recognition technology for live captioning, perhaps with lesser quality standard for a while, but as ever higher quality chips are made with more artificial intelligence is added, ASR will eventually simulate with consistency more than just acceptable quality meeting mandates set by national governments legislated minimum standards.

jowdjbrown said...

But having a few hours made available from an inspired jet-lagged insomnia I decided to go ahead and write about it.speech recognition program