Pages

Saturday, June 5, 2010

Speech Recognition--continued


Thanks to all who kindly commented, either privately or through this blog to my response to Robert Fortner's piece on speech recognition. For completeness, I am reporting here his comment, and my response to his comment. 

On May 30th Robert Fortner said:

Hi, Roberto:
Thank you for reading and your impassioned comment.
I read your blog and you write "If you think that speech recognition technology, after 50 years of so of research, would bring us HAL 9000, you are right to think it is dead."
That's what I think!
You go on to say "that type of speech recognition was never alive, except in the dreams of science-fiction writers." I agree that SF writers were big purveyors of that dream, but I think a lot of other people believed in it too, maybe most people--and that's why the death of that dream has gone unrecognized. Nobody wants to talk about it. It's pretty shocking.
What do you mean computers aren't automatically (i.e. with a lot of work by smart people like you) going to progress to understanding language?
Hard to believe.

On May 30th Roberto Pieraccini said:

Hi Robert ... thanks for the response to my response to your blog ... I started working in speech recognition research in 1981 ... Since then I built speech recognizers, spoken language understanding systems, and finally those dialog system on the phone that some people hate and techies call IVRs.. (now I don't build anything anymore because I am a manager :) ) ... but during all this time I never believed I would see a HAL-like computer in my life time. And I am sure the thousands of serious colleagues and researchers in speech technology around the world never believed that either. At the end we are engineers who build machines. And as we get to realize the inscrutable complexity and sophistication of human intelligence (and speech is one of the most evident manifestations of that), and the principles on which we base our machines, we soon understand that building something even remotely comparable to a human speaking to another human is beyond the realm of today's technology, and probably beyond the realm of the technology of the next few decades (but of course you never know ... we could not predict the Web 20 years ago...could we?).
Speech recognition is a mechanical thing ... you get a digitized signal from a microphone, chop it in small pieces, compare the pieces to the models of speech sounds you previously stored in a computer's memory, and give each piece a "likelihood" to be part of that sound. Pieces of sounds make sounds, sounds make words, words make sentences, and you keep scoring all the hypotheses in an orderly fashion based on statistical models of larger and larger entities (sounds, words, sentences), such as models of the probability a sound following other sounds in a word, a word following other words in a sentence, and so on. At the end you come up with an hypothesis of what was said. And using the mathematical recipes prescribed by the engineers who worked that out, you get a correct hypothesis most of the times... "most of the times" ... not always. If you do the things right, that "most of the times" can become large ... but never 100%. There is never 100% in anything humans, or nature, make...but sometimes you can get pretty damn close to it..and that's what we strive for as engineers.
So, there is no human-like intelligence (God forbid HAL-like evil intelligence) in speech recognition. No intelligence in the traditional human-like sense ... (but ...what's intelligence anyway?). There is no knowledge of the world, there is not perception of the world, and having experienced and thought about the world for every minute of our conscious and unconscious life. Speech recognition is a machine which compares pieces of signal with models of them ... period. And doing that with the "statistical" way works orders of magnitude much better than doing it in a more "knowledge-based" inferential, reasoning way...I mean doing it in an AI-sh manner... We tried that--the AI-sh knowledge-based approach--very hard in the 1970s and 1980s but it always failed, until the "statistical" brute force approach started to prevail and gain popularity in the early 1980s. AI failed because the assumption on which it was based presumed you can put all the knowledge into a computer by creating rational models that explain the world...and letting the computer reason about it. At the end it is the eternal struggle between rationalism and empiricism .. .elegant rationalism (AI) lost the battle (someone think the battle .... not the war) because stupid brute-force pragmatic empiricism (statistics) was cheaper and more effective ...
So, if you accept that ...i.e. if you accept that speech recognition is a mechanical thing with no pretense of HAL-like "Can't do that Dave" conversations, you start believing that even that dummy mechanical thing can be useful. For instance, instead of asking people to push buttons on a 12 key telephone keypad, you can ask them to say things. Instead of pushing the first three letters of the movie you wanna see, you can ask them to "say the name of the movie you wanna see" (do you remember the hilarious Seinfeld episode were Kramer pretended he was an IVR system? http://www.youtube.com/watch?v=uAb3TcSWu7Q) ... and why not? if you are driving your car, you can probably use that mechanical thing to enter the new destination on your navigation system without fidgeting with its touch screen. And maybe, you may be able to do the same with your iPhone or Android phone. At the basis there is a belief that saying things is more natural and effective that pushing button on a keypad, at least in certain situations). And one thing leads to another...technology builds on technology...creating more and more complex things that hopefully work better and better. These are the dreams of us engineers ... not the dream of HAL (although I have to say that probably that dream unconsciously attracted us to this field). Why that disconnect between engineer's dreams and laypeople dreams? Who knows? But, as I said, bad scientific press, bad media, movies, and bad marketing probably contributed to that, besides the collective unconscious of our species, that of building a machine that resembles us in all our manifestations (Pygmalion?).
I am not sure about your last questions. What I meant is that computers *are* automatically going to progress in language understanding. But they are doing that by following "learning recipes" prescribed by the smart people out there and digesting oodles of data (which is more and more available, and computers are good at that). The learning recipes we figured out until now brought us so far. If we don't give up in teaching and fostering speech recognition and machine learning research, one day some smart kid from some famous or less famous university somewhere in the world will figure out a smarter "recipe"... and maybe we will have a HAL-like speech recognizer .. or something closer to it...

No comments: