I read Robert Fortner’s blog post on the
death of speech recognition as soon as it came out, about a month ago. For several reasons, it took me a while to decide to craft a response, not last my eternal laziness and procrastination attitude. But having a few hours made available from an inspired jet-lagged insomnia I decided to go ahead and write about it.
I have to admit that speech recognition research and the use of it has occupied more than half of my life. So, I am personally mentally and sentimentally attached to speech recognition. But at the same time I am frustrated, disappointed, disillusioned at times. Speech recognition is a great technology, I made a career out of it, but yes…it does make mistakes.
What I felt when I read Fortner’s post was no surprise. We all may feel that speech recognition did not keep up with its promises. Even us, who have been working on it and with it for decades, sometimes feel that sense of failure of an unrealized dream. But writing eulogies for an undead person is not fair. First of all speech recognition--the undead one--is not, and does not want to be, what laypeople think it is. If you think that speech recognition technology, after 50 years of so of research, would bring us HAL 9000, you are right to think it is dead. But that type of speech recognition was never alive, except in the dreams of science-fiction writers. If you think that that’s the speech recognition we should have strived for, yes … that’s dead. But I would say that that dream was never alive in any reasonable way for most of the speech scientists—so they call us geeks who have dedicated time and cycles to making computer recognizing speech. We all knew that we would never see a HAL 9000 in our lifetimes.
Saying that speech recognition is dead because its accuracy falls far short of HAL-like levels of comprehension is like saying that aeronautical engineering is dead because commercial airplanes cannot go faster than 1,000 miles per hour, and by the way … they cannot get people to the moon. Similarly we can say that medicine is dead because we cannot always cure cancer, or that computer science is dead because my PC gets jammed and I have to reboot it now and then. There are limitations in any one of our technologies, but the major limitations we perceive are the result of our false assumptions of what the goals are, our wrong use of the technology, and the wrong promises divulged by pseudoscientific press and media. Speech recognition is not about building HAL 9000. Speech recognition is about building tools, and as all tools, it may be imperfect. Our job is trying to find a good use of an imperfect, often crummy, tool that can sometimes make our life easier.
Robert Fortner’s blog post captures some truths about speech recognition, but it is more like a collection of data read here and there outside of the context. I would probably make the same mistakes if I read a few papers on genetic research and tried to write a critique of the field. For instance, among many other things, it is not accurate to say that
the accuracy of speech recognition flat- lined in 2001 before reaching human levels. It is true that “some” funding plugs were pulled—mainly the DARPA funds on
interactive speech recognition projects mostly devoted to dialog systems
. But 2001 was a year of great dramatic changes for many things. 9/11 brought the attention of the funding agencies on some more urgently important tasks than that of talking to computers for making flight reservations.
The funding plug on speech recognition technology
was not pulled, but the goals were changed. For instance, DARPA itself started a quite large project, called
GALE (as in Global Autonomous Language Exploitation) one of which goals was to
interpret huge volumes of speech and text in multiple languages. And of course, the main purpose of that was for homeland security. The amount of audio information available today—Web, broadcasts, recorded calls, etc.—is huge even compared with the amount of text. Without speech recognition there is no way we can search through it, so all potentially useful information is virtually unavailable to us. And that is worsened by the fact that the vast majority of the audio available around us is not in English, but in many other languages for which a human translator may not be at hand when and where we want it. Now, even an imperfect tool such as speech recognition, associated to an imperfect machine translation, can still give us a handle to search and interpret vast amounts of raw audio. An imperfect transcription followed by an imperfect translation can still give us a hint on
what’s that about and maybe help us select a few audio samples to be listened to or translated by a professional human translator. That’s better than nothing, and even a small imperfect help can be better than no help at all. GALE, and similar projects around the world gave rise to new goals for speech recognition and to a wealth of research and studies which is reflected in the rising number of papers at the major international conferences and journals. Large conferences like
ICASSP and
Interspeech, like the dozens of specialized workshops, attract thousands of researchers around the world every year. And the number is not declining. Other
non-traditional uses of speech recognition and speech technology in general emerged like, to cite a few, emotion detection, or even the recognition of deception through speech analysis, which proves to be more accurate than human’s (apparently computers can detect layers from their speech better than parole officers, who do a measly 42%...)
The number of scientific papers on speech recognition—and not just HAL-like speech recognition—rose continuously since scientists and technologists started to look at that. The following chart shows the number of speech recognition papers (dubbed ASR, as in Automatic Speech Recognition) as a fraction of the total number of papers presented at the ICASSP conference from 1978 to 2006:
Kindly made available by Prof. Sadaoki Furi, from the Tokyo Institute of Technology
The number kept growing after 2006, and it shows similar figures for other conferences. So speech recognition is not dead.
Just to cite another inaccuracy of Fortner’s blog post—one that particularly touched me—is the mention of the famous sentence “Every time I fire a linguist my system improves” said—and personally confirmed—by one of the fathers of speech recognition, then the head of speech recognition research at IBM. The meaning of that famous, or in-famous—sentence is not a
conscious rejection of the deeper dimensions of language. Au contraire. It is the realization that classic linguistic research, based on rules and models derived by linguist’s introspection can only bring you so far. Beyond that you need data. Large amounts of data. And you cannot deal with large amounts of data by creating more and more rules in a scholarly
amanuensis manner; you need some powerful tool that can extract information from larger and larger amounts of data without an expert linguist having to look at it bit by bit. And that tool is statistics. In the long run statistics showed to be so powerful that most linguists became statisticians. The preponderance of statistics in linguistics today is immense. It is enough to go to a conference like ACL (the annual conference of the Association for Computational Linguistics) to see the amount of topics that used to be approached by traditional linguistics and are now the realm of statistics, mathematics, and machine learning. We would not have Web search if we approached it with traditional linguistic methods. Web search makes mistakes, but yet it is useful. We would not have the Web if we did not have statistically-based Web search. We would not have speech recognition, nor machine translation and many other language technologies, with all their limitations, if we did not abandon the traditional linguistic way and embraced the
statistical linguistic way. And by the way … that famous speech recognition scientist (whose name is Fred Jelinek, for the records) gave a talk in 2004 entitled “
Some of my best friends are linguists.”
Let’s talk now about speech recognition accuracy. The most frustrating perception of speech recognition accuracy—or the lack of it—is when we interact with its commercial realizations: dictation and what people of the trade call IVR, or Interactive Voice Response.
Half of the people who commented on Fortner’s blog are happy with automated dictation and have been using it for years. For them the dictation tool was well worth the little time spent in learning how to use it and training it. It is also true that many people tried dictation and it did not work for them. But most likely they were not motivated to use it. If you have a physical disability, or if you need to dictate thousands of words every day, most likely speech recognition dictation will work for you. Or better, you will learn how to make it work. Again, speech recognition is a tool. If you use a tool you need to be motivated to use it, and learn how to use it. And I repeat it here…this is the main concept, the take away. Speech recognition is a tool built by engineers, not an attempt to replicate human intelligence. Tools should be used when needed. We need a little patience, use them for the goal they were designed for, and they can help us.
And let’s get now to IVR systems, those you talk to on the phone when you would like instead to talk to a human; with no doubt they are the most pervasive, apparent, and often annoying manifestation of speech recognition and its limitations. They are perceived so badly that even
Saturday Night Live makes fun of them. There is even a Web site,
gethuman.com, which regularly publishes a cheat sheet to go around them and get a human operator right away. But are they so bad? After all, thanks to speech recognition, hundreds of thousands of people can get
up to the minute flight information right away by calling the major airlines, or make flight and train reservation, get information from their bank accounts, and even get a call 24 hours before their flight and check in automatically even if they are not connected to internet. Without speech recognition that would require hundreds of thousands of live operators— an unaffordable cost for the service providing companies—and tens of minutes waiting in queue listening to wait music for their customers. Yet these systems make irritating mistakes, sometimes. But they turn out to be useful for most of the people, hundreds of thousands of them. And I go again with leitmotif. They are tools, and tools can be useful when used properly and when truly needed.
In 2001 recognition accuracy topped out at 80%, far short of HAL-like levels of comprehension. What does recognition accuracy mean? As all measures do, it means nothing outside a context. Recognition accuracy is measured in different ways, but it most of the cases it measure how many words does the recognizer get wrong as a percentage of all words recognized. But that depends on the context. In IVR systems speech recognizer can get very specialized. The more they
hear, the better they are. I work for a company called
SpeechCycle. We build sophisticated speech recognition systems that help the customers of our customers—typically service providers—get support and often solve problems. Using statistics and lots of data we have seen the accuracy of our speech recognition interactive systems grow continuously in time and get better and better as the speech recognizer
learned from the experience of thousands of callers. I am sure other companies that build similar system—our competitors--would claim a similar trend (although I am tempted to say that we do a little better than them …). As Jeff Foley—a former colleague of mine—said in his beautifully articulated answer to Fortner’s blog post “[…] any discussion of speech recognition is useless without defining the task […].” And I especially like Jeff’s hammer analogy:
This is like saying that the accuracy of a hammer is only 33%, since it was able to pound a nail but failed miserably at fastening screws and stapling papers together.
In specialized tasks speech recognition can get well above the 80% accuracy mentioned in Fortner’s blog post, which refers to a particular context, that of for a
very large vocabulary open dictation task. By automatically learning from data acquired during the lifecycle of a deployed speech recognizer you can get to the 90s on average and to the high 90s in specially constrained tasks (if you are technically inclined you can see, for instance, one of our recent
papers). With that you can build useful applications. Yes, you get mistakes now and then, but they can be gracefully handled by a well designed Voice User Interface. In particular, recognition of digits, and command vocabularies, like
yes and
no today can go beyond 99% summing up to one error every few thousands of entries, which can be automatically corrected if other information is taken into account, like for instance the checksum on your credit card number. Another thing to take into consideration is that most errors are not because speech recognition sucks, but because people, especially occasional users, do not use speech recognition systems for what they were designed. Even if they are explicitly and kindly asked to respond with a simple yes or no, they go and say something else, like if they were talking to an omnipotent human operator. It is like if you went to an ATM machine, and enter the amount you want to cash in when you are asked for your pin … and then you complain because the machine did not understand you! I repeat it again: speech recognition is a tool, not HAL 9000, and as such users should use it for what it is designed for and follow the instructions provided by the prompts; it won’t work well otherwise.
And finally I would like to mention the recent surge of speech recognition applications for voice search and control on smartphones (Google, Vlingo, Bing, Promptu, and counting). We do that at SpeechCycle too, in our customer care view of the world. This is a new area of application of speech recognition which is likely to succeed because, in certain situations, speaking to a smartphone is certainly more convenient than typing on it. The co-evolution of vendors who will constantly improve their products, and motivated users who will constantly learn how to use voice search and cope with its idiosyncrasies (like we do all the times with PCs, word processors, blog editing interfaces, and every imperfect tool we are using to accomplish a purpose) will make speech recognition a transparent technology. A technology we use every day and we are not anymore aware of, like the telephone, the mouse, internet, and everything else that makes our life easier.