Pages

Tuesday, December 7, 2010

What’s the “I” in “AI”?


When people ask me what I do for a living I simply say: I do “AI”, as in Artificial Intelligence. In reality I always tried to stay away from the term AI until I realized it’s the easiest explanation to describe what I do, at least in a general sense and to the layperson. I don’t like terms like AI—likewise we do not like the now out-of-fashion term “electronic brain-- because words like “intelligence” easily create false expectations and confusion.  First of all, how can we talk about Artificial Intelligence when we do not have the faintest idea what the natural one is? That vaguely reminds me Wittgenstein’s remark to a friend of his what told him she was so sick she felt “like a dog that has been run over “:  You don’t know what  a dog that has been run over feels like!”.   
                But, besides philosophical characterizations of imprecise generalizations, metaphors and analogies, after its boom in the 1970s, AI came to relate mainly to an approach to building “intelligent” machines which, in some way, mimicked what we believed the human intelligent process is. In other words, AI methods, for a long time, were based on a well defined inference process which, starting from some facts and observations, elegantly led to conclusion based on a more or less large set of rules.  The rules were typically derived and painfully coded into a computer by “expert” humans.  But, unfortunately, that process never really worked for building machines simulating basic human activities, like speaking, understanding language, and making sense of images.  
                A different approach, developed by people who humbly called themselves “engineers”—and Fred Jelinek, who sadly passed away last September, was one of them—did not have the pretense of  “replicating human brain capabilities”, but simply approached human capabilities into machines, like the recognition and understanding of speech, from a statistical point of view.  In other words, no rules compiled by experts, but machines that autonomously digest millions and millions of data samples, to then match them—in a statistical sense—to the observation of the reality and draw conclusion based on that. I belong to this school, and for a long time, like all the others in this school, I did not want to be associated with AI.
                But today, probably, it does not make much sense to make a distinction anymore. The AI discipline has assumed so many angles, so many variations that it does not characterize anymore a “way to do things”, but the final result. The term AI, which had disappeared for a decade or more—during the so called AI winter—came back, probably resuscitated by Spielberg’s movie and, more and more, laypeople associate AI to building machines that (or should we say ‘who’?) try do what humans do…more or less. That is: speak, understand, translate, and draw conclusions from data. Unfortunately there are still some who want to make that distinction when they pitch their technology and say …we use AI …which I believe is nonsense.  What’s the “I” in AI? So, yes … I work in AI…if you like that.   

Tuesday, August 17, 2010

Inverse Reality


Upside-down reality reflected in the lake of a tiny village on the Italian western Alps, (Valle D'Aosta, Lake Lod, Valtournenche). Reality blends with a reflection of it , reeds become leaves, falling from an improbable pastel blurred sky.

Saturday, July 24, 2010

The Emptiness of the Language Universe


As we know, the universe if pretty much empty. There is nothing there, really. The 1080 or so atoms that constitute the total amount of matter available to us are spread around in a vast universe of infinite emptiness, sometimes concentrated in stars and planet.  In the most remote parts of the universe the chance of finding an atom is actually as remote as finding a needle in the ocean. There is emptiness at the cosmic scale, but also at the atomic scale. Every one atom is a universe of emptiness in itself. We have the perception of solid matter: the elasticity of my skin if I pinch myself, or the hardness of a marble tabletop. But that’s just an illusion created by the forces that keep the electrons orbiting around the nucleus and those keep the protons and the neutrons together. The reality is that in the electron orbits there is absolutely nothing, emptiness so to speak. Each electron is a universe away from the other and from the nucleus.  Everything else is emptiness. If I were insensitive to the forces inside the atoms, I could penetrate through the walls, like a ghost; because there is nothing really there.

Actually the thought of the emptiness of the universe came to me in relation with an analogous emptiness:  that of the universe of language.   Borges’ library of Babel is a vast emptiness of nonsensical books, an universe of empty gibberish where only a tiny fraction of the books are actually readable. That fraction is way tinier than the actual density of matter in the universe. 
Let's not even talk about books (each book in the Babel library has exactly 410 pages), but just a single page. Let's not even think about the whole Shakespeare, but just one sonnet that can comfortably fit in a double spaced page. Let’s say we have 30 characters in all: all the letters, punctuation, and space, and let’s say we can fit 1000 characters in a page. The number of possible pages of text obtained as all the possible random permutations of the 30 characters is 30 to the power of 1000 which is a number with a thousand zeros, more or less. A disproportionally huge number even compared with the number of atoms in the universe, which is only ten to the eighty. There won’t be enough atoms in the universe to print all these pages and not even to store them digitally.  And only a teeny-tiny fraction of them actually make some sense, is somehow readable.  That’s at the page or book scale. But language is empty at all scales, like the matter in the universe from the cosmic to the atomic one.  At the lexical level, if you consider the actual words of your language, they are only a tiny fraction of all the possible words you can build with the symbols of your alphabet. At the syntactic level, the combinations of words which are actually grammatically correct are only an infinitesimal fraction of all the possible combinations of words. Same at the semantic level: among all of the syntactically correct sentences, those that actually make sense are just a tiny fraction. Like the universe and the atoms, the language we actually speak and write is only just a small, infinitesimally small, fraction of sense lost in a vast empty combinatorial universe. Everything else is emptiness.

Saturday, June 19, 2010

The wick, the internet, and Andy Warhol


I just finished reading Nicholas Carr’s “The Big Switch: Rewiring the World, from Edison to Google.” It is a book about technology, and how certain types of technologies, which are initially intended for individual fruition, gradually become utilities.  Electric power is a classic example of that phenomenon. After the electric power generator became cost effective, factories and some households started to buy, own, and manage their own generators. But then someone—he was Edison’s personal secretary Samuel Insull—figured out that producing energy in a single place and distributing it for a fee would have been much more efficient and cheaper for everyone. In other words Insull created the “hosted” or “on demand” model for electric power: you pay for what you use. That was not the first example of this progression from individual usage to utility. Before him and before electric power was possible, Henry Burden built in 1852 a giant water wheel somewhere in upstate New York and used it to distribute mechanical power to the farms and factories in its neighborhood. And if you think about it, many other technologies followed the same path. Music:  from individual gramophones to radio. Transportation:  from individual coaches to public transports. And finally the computer: from individual computers to the cloud, or what Carr calls the World Wide Computer.  
Interesting enough, computers started with the idea of being mainly a utility. Computers in the 1970s and 1980s were too expensive for individuals and small companies to own them, so they were deployed as mainframes in centralized locations, and computing power was distributed to its users for a metered fee on used CPU.  But unfortunately, back then, data sharing was quite difficult because of the modest data communication bandwidth available. So yes, you could have your data stored in some central location, but uploading or download it from a remote location could take huge amounts of time, unless you could walk or drive to the central computing facility and give them your tapes or stacks of cards. That’s one of the reasons that made personal computing so popular. So we went back from a utility to individual deployment just because the infrastructure was not able to properly support the distribution. But now it is a different story. Bandwidth increased enormously during the past few years, and the idea of a centralized computer became all of a sudden, and again, a viable and attractive alternative to personal computers. In just a couple of years the term “cloud computing” started to acquire immense popularity, indicating a model where computer power (CPU cycles) is distributed from a virtually centralized location to everywhere in the world. It does not take much imagination to see a future not far from now where we won’t buy full-fledged computers for our home or offices, but simple appliances which, once connected to the network, will be able to use the virtually infinite computer power and storage provided by the Amazons and Googles of the world. Keeping tens of thousands of song tracks, pictures, and movies stored in our home computers as we have done during the past years—with the added complication of keping them sorted, backed up, etc.—starts feeling a little bit outdated and unnecessary once you start using services like Pandora, Flick, or Netflix.  
But let’s get back to the thing I wanted to talk about: the wick. The epilogue of Carr’s book starts by saying the wick is one of man’s greatest inventions as well as one of the most modest ones.  It allowed to move from the primitive torchlight to the more civilized candles which remained the dominant lighting technology  for hundreds of years, only substituted by the wickless gas lamp and then by Edison’s incandescent bulb. Indeed the bulb did bring huge benefits to the society and the industry but, in Carr’s own words, it also brought subtle and unexpected changes to the way people lived. The candle constituted a focal point for families. In the evening families gathered around the flickering light of a candle to talk, tell stories, be together. With the advent of electric light, the family started to disperse around the house, and each family member started to spend more time in their own rooms or spaces during the evening.
Other technologies brought societal changes of the order of magnitude of those brought by electric light. I remember when I was a kid, there was only one television in my house—like in most of the other people’s houses—and at that time in Italy we had only two TV channels. The whole family was waiting for the time, after dinner, to gather around the only TV and watch the pretty much only choice of show, or movie, because the choice of which channel to watch was often obvious: a thriller on one channel, a documentary on shepherding on the other, the most popular quiz show of the decade on one channel, chamber music on the other one.  The whole family sat in front of the TV watching in silence, all together every night. I even remember when we bought the first washing machine – the first day we all sat in front of it in awe, waiting for the washing program to change from prewash, to wash, and rinse. That night the excitement of the new machine surpassed that of the show on TV.   Then, with more TVs in the house, and with so many channels to chose form, that moment of togetherness disappeared, and everyone was in their room after dinner, listening to music or watching their favorite show. And no…we don’t watch washing machines anymore …
The same thing happened with computers. At the beginning of my work career, in the early 1980s, my wife and I used to live in the same building where a friend couple lived. We and the other couple decided to share the costs and buy a Commodore 64 together. There, after dinner, we were gathering in one of the two apartments, and play with our first home computer which had a TV as its monitor, and a cassette recorder as the only storage device, substituted one year later by a brand new shining 5 inch floppy disk unit. Hours and hours together, sipping wine and playing the Aztec Tomb, Pac Man, and Mission 5 rescue. Then PCs became cheap, cheap enough for each one of us to have our own and we stopped getting together to play video games.  Then Internet and the Web came. I remember Mosaic, the first browse. I compiled it and installed it on my remote UNIX machine first, and then on my home PC. The first days of the Web created such an intense sense of curiosity that it was not uncommon for the whole family to gather around the PC and browse. Then we got use to it. More PCs at home, wireless internet, everyone navigating to wherever they felt like.
On the one hand the personal computer and the internet greatly contributed, more than any other technology, to the dispersion of the social nuclei. Browsing the internet, listening to online streamed music,   watching online movies, is a personal affair, not a tribe gathering moment. But paradoxically, on the other hand, the internet helped people get closer. Who hasn’t found lost friends and connected again with them on Facebook, LinkedIn, or Skype? Or just googled the names of that old buddy of yours and found that he is now a famous writer, or has a webpage with his picture and email address? Our societal rules and rituals and the way we connect with each other are changing forever because of the technology—the web—who had one of the most large-scale impacts on the whole humanity. So, paradoxically the same internet which brought us apart brings together. Not in the same way as the wick, but in a way which is completely different and new.  
When I was reading Carr’s book discussing how internet is giving everyone in a unique opportunity to publish and share their writings, pictures, movies, etc. , my first thought was of Andy Warhol famous quote “In the future everyone will be world-famous for 15 minutes." I don’t know if Andy Warhol envisioned something as big as the web, but definitely that is what the web is bringing us: the potential to get our 15 minutes of fame.

Saturday, June 5, 2010

Speech Recognition--continued


Thanks to all who kindly commented, either privately or through this blog to my response to Robert Fortner's piece on speech recognition. For completeness, I am reporting here his comment, and my response to his comment. 

On May 30th Robert Fortner said:

Hi, Roberto:
Thank you for reading and your impassioned comment.
I read your blog and you write "If you think that speech recognition technology, after 50 years of so of research, would bring us HAL 9000, you are right to think it is dead."
That's what I think!
You go on to say "that type of speech recognition was never alive, except in the dreams of science-fiction writers." I agree that SF writers were big purveyors of that dream, but I think a lot of other people believed in it too, maybe most people--and that's why the death of that dream has gone unrecognized. Nobody wants to talk about it. It's pretty shocking.
What do you mean computers aren't automatically (i.e. with a lot of work by smart people like you) going to progress to understanding language?
Hard to believe.

On May 30th Roberto Pieraccini said:

Hi Robert ... thanks for the response to my response to your blog ... I started working in speech recognition research in 1981 ... Since then I built speech recognizers, spoken language understanding systems, and finally those dialog system on the phone that some people hate and techies call IVRs.. (now I don't build anything anymore because I am a manager :) ) ... but during all this time I never believed I would see a HAL-like computer in my life time. And I am sure the thousands of serious colleagues and researchers in speech technology around the world never believed that either. At the end we are engineers who build machines. And as we get to realize the inscrutable complexity and sophistication of human intelligence (and speech is one of the most evident manifestations of that), and the principles on which we base our machines, we soon understand that building something even remotely comparable to a human speaking to another human is beyond the realm of today's technology, and probably beyond the realm of the technology of the next few decades (but of course you never know ... we could not predict the Web 20 years ago...could we?).
Speech recognition is a mechanical thing ... you get a digitized signal from a microphone, chop it in small pieces, compare the pieces to the models of speech sounds you previously stored in a computer's memory, and give each piece a "likelihood" to be part of that sound. Pieces of sounds make sounds, sounds make words, words make sentences, and you keep scoring all the hypotheses in an orderly fashion based on statistical models of larger and larger entities (sounds, words, sentences), such as models of the probability a sound following other sounds in a word, a word following other words in a sentence, and so on. At the end you come up with an hypothesis of what was said. And using the mathematical recipes prescribed by the engineers who worked that out, you get a correct hypothesis most of the times... "most of the times" ... not always. If you do the things right, that "most of the times" can become large ... but never 100%. There is never 100% in anything humans, or nature, make...but sometimes you can get pretty damn close to it..and that's what we strive for as engineers.
So, there is no human-like intelligence (God forbid HAL-like evil intelligence) in speech recognition. No intelligence in the traditional human-like sense ... (but ...what's intelligence anyway?). There is no knowledge of the world, there is not perception of the world, and having experienced and thought about the world for every minute of our conscious and unconscious life. Speech recognition is a machine which compares pieces of signal with models of them ... period. And doing that with the "statistical" way works orders of magnitude much better than doing it in a more "knowledge-based" inferential, reasoning way...I mean doing it in an AI-sh manner... We tried that--the AI-sh knowledge-based approach--very hard in the 1970s and 1980s but it always failed, until the "statistical" brute force approach started to prevail and gain popularity in the early 1980s. AI failed because the assumption on which it was based presumed you can put all the knowledge into a computer by creating rational models that explain the world...and letting the computer reason about it. At the end it is the eternal struggle between rationalism and empiricism .. .elegant rationalism (AI) lost the battle (someone think the battle .... not the war) because stupid brute-force pragmatic empiricism (statistics) was cheaper and more effective ...
So, if you accept that ...i.e. if you accept that speech recognition is a mechanical thing with no pretense of HAL-like "Can't do that Dave" conversations, you start believing that even that dummy mechanical thing can be useful. For instance, instead of asking people to push buttons on a 12 key telephone keypad, you can ask them to say things. Instead of pushing the first three letters of the movie you wanna see, you can ask them to "say the name of the movie you wanna see" (do you remember the hilarious Seinfeld episode were Kramer pretended he was an IVR system? http://www.youtube.com/watch?v=uAb3TcSWu7Q) ... and why not? if you are driving your car, you can probably use that mechanical thing to enter the new destination on your navigation system without fidgeting with its touch screen. And maybe, you may be able to do the same with your iPhone or Android phone. At the basis there is a belief that saying things is more natural and effective that pushing button on a keypad, at least in certain situations). And one thing leads to another...technology builds on technology...creating more and more complex things that hopefully work better and better. These are the dreams of us engineers ... not the dream of HAL (although I have to say that probably that dream unconsciously attracted us to this field). Why that disconnect between engineer's dreams and laypeople dreams? Who knows? But, as I said, bad scientific press, bad media, movies, and bad marketing probably contributed to that, besides the collective unconscious of our species, that of building a machine that resembles us in all our manifestations (Pygmalion?).
I am not sure about your last questions. What I meant is that computers *are* automatically going to progress in language understanding. But they are doing that by following "learning recipes" prescribed by the smart people out there and digesting oodles of data (which is more and more available, and computers are good at that). The learning recipes we figured out until now brought us so far. If we don't give up in teaching and fostering speech recognition and machine learning research, one day some smart kid from some famous or less famous university somewhere in the world will figure out a smarter "recipe"... and maybe we will have a HAL-like speech recognizer .. or something closer to it...

Sunday, May 30, 2010

Un-rest in Peas: The Unrecognized Life of Speech Recognition (or “Why we do not have HAL 9000 yet”)

I read Robert Fortner’s blog post on the death of speech recognition  as soon as it came out, about a month ago. For several reasons, it took me a while to decide to craft a response, not last my eternal laziness and procrastination attitude. But having a few hours made available from an inspired jet-lagged insomnia I decided to go ahead and write about it.

I have to admit that speech recognition research and the use of it has occupied more than half of my life. So, I am personally mentally and sentimentally attached to speech recognition. But at the same time I am frustrated, disappointed, disillusioned at times.  Speech recognition is a great technology, I made a career out of it, but yes…it does make mistakes.

What I felt when I read Fortner’s post was no surprise. We all may feel that speech recognition did not keep up with its promises. Even us, who have been working on it and with it for decades, sometimes feel that sense of failure of an unrealized dream.   But writing eulogies for an undead person is not fair. First of all speech recognition--the undead one--is not, and does not want to be, what laypeople think it is. If you think that speech recognition technology, after 50 years of so of research, would bring us HAL 9000, you are right to think it is dead. But that type of speech recognition was never alive, except in the dreams of science-fiction writers. If you think that that’s the speech recognition we should have strived for, yes … that’s dead. But I would say that that dream was never alive in any reasonable way for most of the speech scientists—so they call us geeks who have dedicated time and cycles to making computer recognizing speech. We all knew that we would never see a HAL 9000 in our lifetimes.

Saying that speech recognition is dead because its accuracy falls far short of HAL-like levels of comprehension is like saying that aeronautical engineering is dead because commercial airplanes cannot go faster than 1,000 miles per hour, and by the way … they cannot get people to the moon. Similarly we can say that medicine is dead because we cannot always cure cancer, or that computer science is dead because my PC gets jammed and I have to reboot it now and then. There are limitations in any one of our technologies,   but the major limitations we perceive are the result of our false assumptions of what the goals are, our wrong use of the technology, and the wrong promises divulged by pseudoscientific press and media. Speech recognition is not about building HAL 9000. Speech recognition is about building tools, and as all tools, it may be imperfect.  Our job is trying to find a good use of an imperfect, often crummy, tool that can sometimes make our life easier.

Robert Fortner’s  blog post captures some truths about speech recognition, but it is more like a collection of data read here and there outside of the context. I would probably make the same mistakes if I read a few papers on genetic research and tried to write a critique of the field. For instance, among many other things, it is not accurate to say that the accuracy of speech recognition flat- lined in 2001 before reaching human levels.  It is true that “some” funding plugs were pulled—mainly the DARPA funds on interactive speech recognition projects mostly devoted to dialog systems. But 2001 was a year of great dramatic changes for many things. 9/11 brought the attention of the funding agencies on some more urgently important tasks than that of talking to computers for making flight reservations.  The funding plug on speech recognition technology was not pulled, but the goals were changed. For instance, DARPA itself started a quite large project, called GALE (as in Global Autonomous Language Exploitation) one of which goals was to interpret huge volumes of speech and text in multiple languages. And of course, the main purpose of that was for homeland security. The amount of audio information available today—Web, broadcasts, recorded calls, etc.—is huge even compared with the amount of text. Without speech recognition there is no way we can search through it, so all potentially useful information is virtually unavailable to us. And that is worsened by the fact that the vast majority of the audio available around us is not in English, but in many other languages for which a human translator may not be at hand when and where we want it. Now, even an imperfect tool such as speech recognition, associated to an imperfect machine translation, can still give us a handle to search and interpret vast amounts of raw audio. An imperfect transcription followed by an imperfect translation can still give us a hint on what’s that about and maybe help us select a few audio samples to be listened to or translated by a professional human translator. That’s better than nothing, and even a small imperfect help can be better than no help at all. GALE, and similar projects around the world gave rise to new goals for speech recognition and to a wealth of research and studies which is reflected in the rising number of papers at the major international conferences and journals. Large conferences like ICASSP and Interspeech, like the dozens of specialized workshops, attract thousands of researchers around the world every year. And the number is not declining. Other non-traditional uses of speech recognition and speech technology in general emerged like, to cite a few, emotion detection, or even the recognition of deception through speech analysis, which proves to be more accurate than human’s  (apparently computers can detect layers from their speech better than parole officers, who do  a measly 42%...)
The number of scientific papers on speech recognition—and not just HAL-like speech recognition—rose continuously since scientists and technologists started to look at that. The following chart shows the number of speech recognition papers (dubbed ASR, as in Automatic Speech Recognition) as a fraction of the total number of papers presented at the ICASSP conference from 1978 to 2006:
Kindly made available by Prof. Sadaoki Furi, from the Tokyo Institute of Technology
The number kept growing after 2006, and it shows similar figures for other conferences. So speech recognition is not dead.

Just to cite another inaccuracy of Fortner’s blog post—one that particularly touched me—is the mention of the famous sentence “Every time I fire a linguist my system improves” said—and personally confirmed—by one of the fathers of speech recognition, then the head of speech recognition research at IBM.  The meaning of that famous, or in-famous—sentence is not a conscious rejection of the deeper dimensions of language. Au contraire. It is the realization that classic linguistic research, based on rules and models derived by linguist’s introspection can only bring you so far. Beyond that you need data. Large amounts of data. And you cannot deal with large amounts of data by creating more and more rules in a scholarly amanuensis manner; you need some powerful tool that can extract information from larger and larger amounts of data without an expert linguist having to look at it bit by bit. And that tool is statistics. In the long run statistics showed to be so powerful that most linguists became statisticians. The preponderance of statistics in linguistics today is immense. It is enough to go to a conference like ACL (the annual conference of the Association for Computational Linguistics) to see the amount of topics that used to be  approached by traditional linguistics and are now the realm of statistics, mathematics, and machine learning. We would not have Web search if we approached it with traditional linguistic methods. Web search makes mistakes, but yet it is useful. We would not have the Web if we did not have statistically-based Web search. We would not have speech recognition, nor machine translation and many other language technologies, with all their limitations, if we did not abandon the traditional linguistic way and embraced the statistical linguistic way.  And by the way … that famous speech recognition scientist (whose name is Fred Jelinek, for the records) gave a talk in 2004 entitled “Some of my best friends are linguists.

Let’s talk now about speech recognition accuracy. The most frustrating perception of speech recognition accuracy—or the lack of it—is when we interact with its commercial realizations: dictation and what people of the trade call IVR, or Interactive Voice Response.

Half of the people who commented on Fortner’s blog are happy with automated dictation and have been using it for years. For them the dictation tool was well worth the little time spent in learning how to use it and training it. It is also true that many people tried dictation and it did not work for them.  But most likely they were not motivated to use it. If you have a physical disability, or if you need to dictate thousands of words every day, most likely speech recognition dictation will work for you. Or better, you will learn how to make it work. Again, speech recognition is a tool. If you use a tool you need to be motivated to use it, and learn how to use it.  And I repeat it here…this is the main concept, the take away. Speech recognition is a tool built by engineers, not an attempt to replicate human intelligence. Tools should be used when needed. We need a little patience, use them for the goal they were designed for, and they can help us.

And let’s get now to IVR systems, those you talk to on the phone when you would like instead to talk to a human; with no doubt they are the most pervasive, apparent, and often annoying manifestation of speech recognition and its limitations. They are perceived so badly that even Saturday Night Live makes fun of them. There is even a Web site, gethuman.com, which regularly publishes a cheat sheet to go around them and get a human operator right away. But are they so bad?  After all, thanks to speech recognition, hundreds of thousands of people can get up to the minute flight information  right away by calling the major airlines, or make flight and train reservation, get information from their bank accounts, and even get a call 24 hours before their flight and check in automatically even if they are not connected to internet. Without speech recognition  that would require hundreds of thousands of live operators— an unaffordable cost for the service providing companies—and tens of minutes   waiting in queue listening to wait music for their customers. Yet these systems make irritating mistakes, sometimes.  But they turn out to be useful for most of the people, hundreds of thousands of them. And I go again with leitmotif. They are tools, and tools can be useful when used properly and when truly needed. 

In 2001 recognition accuracy topped out at 80%, far short of HAL-like levels of comprehension. What does recognition accuracy mean? As all measures do, it means nothing outside a context.  Recognition accuracy is measured in different ways, but it most of the cases it measure how many words does the recognizer get wrong as a percentage of all words recognized. But that depends on the context. In IVR systems speech recognizer can get very specialized. The more they hear, the better they are. I work for a company called SpeechCycle.  We build sophisticated speech recognition systems that help the customers of our customers—typically service providers—get support and often solve problems. Using statistics and lots of data we have seen the accuracy of our speech recognition interactive systems grow continuously in time and get better and better as the speech recognizer learned from the experience of thousands of callers.  I am sure other companies that build similar system—our competitors--would claim a similar trend (although I am tempted to say that we do a little better than them …).  As Jeff Foley—a former colleague of mine—said in his beautifully articulated answer to Fortner’s blog post “[…] any discussion of speech recognition is useless without defining the task […].”  And I especially like Jeff’s hammer analogy: This is like saying that the accuracy of a hammer is only 33%, since it was able to pound a nail but failed miserably at fastening screws and stapling papers together. 
In specialized tasks speech recognition can get well above the 80% accuracy mentioned in Fortner’s blog post, which refers to a particular context, that of  for a very large vocabulary open dictation task.  By automatically learning from data acquired during the lifecycle of a deployed speech recognizer you can get to the 90s on average and to the high 90s in specially constrained tasks (if you are technically inclined you can see, for instance, one of our recent papers). With that you can build useful applications. Yes, you get mistakes now and then, but they can be gracefully handled by a well designed Voice User Interface.  In particular, recognition of digits, and command vocabularies, like yes and no today can go beyond 99% summing up to one error every few thousands of entries, which can be automatically corrected if other information is taken into account, like for instance the checksum on your credit card number. Another thing to take into consideration is that most errors are not because speech recognition sucks, but because people, especially occasional users, do not use speech recognition systems for what they were designed. Even if they are explicitly and kindly asked to respond with a simple yes or no, they go and say something else, like if they were talking to an omnipotent human operator. It is like if you went to an ATM machine, and enter the amount you want to cash in when you are asked for your pin … and then you complain because the machine did not understand you! I repeat it again: speech recognition is a tool, not HAL 9000, and as such users should use it for what it is designed for and follow the instructions provided by the prompts; it won’t work well otherwise.

And finally I would like to mention the recent surge of speech recognition applications for voice search and control on smartphones (Google, Vlingo, Bing, Promptu, and counting). We do that at SpeechCycle too, in our customer care view of the world. This is a new area of application of speech recognition which is likely to succeed because, in certain situations,   speaking to a smartphone is certainly more convenient than typing on it. The co-evolution of vendors who will constantly improve their products, and motivated users who will constantly learn how to use voice search and cope with its idiosyncrasies  (like we do all the times with PCs, word processors, blog editing interfaces, and every imperfect tool we are using to accomplish a purpose) will make speech recognition a transparent technology. A technology we use every day and we are not anymore aware of, like the telephone, the mouse, internet, and everything else that makes our life easier.

Tuesday, May 18, 2010

High Dynamic Range Photographs



Pescadero, California (Roberto Pieraccini)

I discovered High Dynamic Range (HDR) photography recently. The idea is to give pictures a higher  range of  light intensity between the lightest and darkest areas of an image. Standard photography allows for a low range of luminosity--lower than the human eye--so if you expose for a luminous sky, the landscape becomes too dark, and if you expose for the landscape, the sky becomes overexposed. The solution consists in shooting two or more identical pictures with different exposure, say one for the sky and one for the landscape, and then merge them.  The result can be pleasing for the eye...sometimes cheesy  ... the art, in my opinion, consists in going for the former while avoiding the latter ...

 San Casciano Val di Pesa, Tuscany, Italy. (Roberto Pieraccini)