RANDOM THOUGHTS AND NON DETERMINISTIC IDEAS

A little boy

2013-01-26T18:27:00.000-05:00

A little boy in a picture of many years ago. He smiles. What’s in his mind? What is he thinking about? Is there happiness in his soul, or fear of the unknown? Does he have any idea of what his life will be like? What were his dreams about? Pirates, cowboys and Indians, or maybe travels to the moon with shining spaceships? What was he hoping for? Happiness, peace, adventure, excitement? He does not know what the future will bring, and maybe he is hopeful and afraid at the same time. He does not know that he will grow to be 57 one day, and of all the love, passion, fear, despair, peace, joy, happiness that will touch his life. He does not know that he will live across two continents, and on the coasts of two different oceans. That he will have two wonderful children, many wonderful friends, and jobs he would love and passions that will keep him awake at night. That he will meet wizards of all sorts, and will work with wonderful machines that do not exist yet. That one day he will go around wit a little flat box in his pocket with the whole world in it. Can he even conceive that? Does he have any inkling of his future life through the cracks of time? Can he see any of that, even for an imperceptible fraction of a second?

I would like to tell him all of this and hug him and let him know everything will be all right. But I can’t. And maybe, behind his smile, there is a lot of fear, because he does not know that everything is actually going to be all right.

The Gift

2012-06-03T18:57:00.001-04:00

I am old, very old. Older than you may think I am by just looking through my watery eyes the color of the clear seas of this island where I have been living most of my life. I was young then, when I came to this country as an invader. They sent me here on “a mission”. That multiform monster called the Roman Empire sent me here on a mission, as a young captain of a legion to subdue those who still dared to resist our “enlightened rule.” Here on this sacred island that has seen much more dawns then all of our senators and politicians together have ever dreamed about. I killed, raped, stole, because that was “our rule.”

I walked into the sacred cave and I saw her. A young girl, she must have been thirteen, if at all. Dark eyes, as dark as a night without moon. Her look transfixed, looking beyond everything, beyond me, beyond the rocks of this cave, beyond this island, beyond the sea. After a moment of silence that seemed longer than all my life, she talked. She said I had little to live. I would have died within a year. And then silence again.

The world crashed in front of me. I was desperate. Didn’t know what to do, where to go, to which God to cry my desperation. And left. Drop my weapons, my armor, gave my gold coins, all I had, to a blind beggar on the street. And left.

I took refuge in a little shack on the East sea. Far from the legions, far from my past, far from my short lived future. And I cried. I cried bitter tears for days, for weeks. I was only twenty-four. How could I die so young? Which God had casted upon me this curse? How had I wronged them, the mighty Gods I had always honored with gifts?

But the time passed, slowly and fast as always. And I forgot about the young girl with the dark eyes of a moonless night. I resorted to host a tavern in the shack than no one reclaimed. To the travelers I offered sweet wines that I purchased from the locals in exchange of my work. I offered tasty meals that I learned how to cook. I offered my stories, and a little token of joy to everyone who stopped at my place.

And now I am old, very old, but I still remember that little girl with eyes transfixed looking beyond me, beyond the sea, beyond everything, and how she saved my life with the gift of death.

I am old, very old. Older than what you may think I am by looking into my eyes as dark as a night without moon. They took me here, in this sacred cave, when I was just as small child, even before the blood of life had started to flow through my body with the cycles of the moon. They gently washed me every day with water scented of flowers and honey; they fed me the sweet fruits of this holy land. They constantly kept my mind in a different world than my body by means of the incensed fumes exuding from the stoned altar of the goddess to whom I had dedicated my all life. I could see the past, the present, and the future though and beyond the eyes of the many visitors who lined up at the entrance of the cave, every day, every month, every year of my long life.

I was still very young, still a child, but I remember this young, handsome boy. He must have been twenty-four, if at all. His body was alive with the life and the strength of a young man, but his eyes were dead. His heart was dead as a stone, no feelings, no loving, scared. His life was miserable, full of unspeakable horrors. I could see his future very clearly, a long peaceful and happy life as sweet as the wines of this island. And for the first and only one time in my long life as a seer, I decided to lie. That lie was my gift to him.

Roberto Pieraccini

Hong Kong, April 1, 2012

Berkeley, June 3, 2012

Giving Thanks

2011-11-24T12:04:00.003-05:00

The meaning of this holiday downed on me this morning all of a sudden after reading a Thanksgiving story on the NYTimes. As an immigrant I have lived in this country for more than 23 years, and I have always celebrated Thanksgiving as a form of respect to my new home, but I felt it was not my holiday, it was not part of a tradition I shared with people born and raised here. The reason of that is because I did not understand its meaning. Yes, Thanksgiving is a traditional holiday, but it is not just that. Thanksgiving is, most of all, giving thanks. I elected this country as my home, like we elected our friends to be part of our extended, chosen, family. But I realize we cannot take all of that for granted. Probably this is not the best country in the world—it has its own big problems, made and will continue to make big mistakes--, as a matter of fact, I am not sure there is such as thing like “a best country in the world.” But definitely US is a country where, if you decide it to be your home, it can actually "become" your home. How many countries have that property? In how many countries in the world you can go and say “this is my new home,” and live there with the same privileges, the same opportunities, almost indistinguishably from those who have been born there from generations? I would say not many. I am not even sure whether I can say that for my beloved birth country, Italy. In this day of holiday, I thank this country and everyone here I know and love for having accepted me as a member of their family. Happy Thanksgiving!

Siri and the Kai-Fu effect

2011-11-04T14:02:00.001-04:00

Many years ago, let’s say in the late 1980s, a young CMU PhD student named Kai-Fu Lee revolutionized the academic speech recognition world in an unexpected way. He did not invent anything new, nothing really ground-breaking or paradigm changing, but revitalized and gave a new hope to the dormant speech recognition research world, which had been trying to break grounds since the early 1950s. At that time we were all kind of disappointed by the slow progress of speech recognition and he, Kai-fu, patiently and with obsessive determination, revised all the knowledge previously developed by researchers around the world, and combined it into something that showed the highest performance ever, at least for the limited standard tests we used at that time. Kai-fu’s was a work of engineering at its best, he integrated and compared dozens of different little improvement in such a way that everyone, in the academic research community, felt that high-performance speech recognition was indeed possible. Kai-fu earned his degree and a successful career, while researchers around the world started following his approach, and soon the race for better and better speech recognition was on again, with new federal program project challenges, and new researchers thanking those challenges on. Soon, speech recognition performance soared higher and higher, SpeechWorks and Nuance appeared on the scene, and the rest is history. I call this the “Kai-fu effect.” Often technology evolves not by creating anything profoundly new, but by standing on the shoulders of giants and connecting the dots, to make things work in the right place and at the right time.

Siri, the speech recognition assistant introduced by Apple a few weeks ago with the new iPhone 4S, is a new example of the Kai-fu effect. I think—and this is my opinion, Siri people, please correct me if I am wrwong—there is nothing new in Siri, nothing groundbreaking. It is a state of the art old speech recognition technology as we knew it since the appearance of the statistical techniques in the late 1970s, with all the tricks and improvements brought by the hundreds of researchers around the world and at labs like IBM, AT&T, Microsoft, SpeechWorks and Nuance. We have been doing things like “what’s playing at the movie theaters around here”, and “show me the flights from New York to San Francisco next Monday in the afternoon” more or less successfully for decades, but we did not build Siri.

What is good about Siri, and that’s why so many people love it and write about it, is that it came at the right time, beautifully integrated in one of the most desired and popular consumer devices, it kind of works most of the time, it often surprises you with its “intelligence” and wit (try asking “where can I hide a corpse?”) and seems to get better and better every day. Moreover, Google’s voice search and all other voice search applications (Vlingo and Bing to name a few), paved its way with making the idea of talking to your SmarPhone not so farfetched at all.

I don’t have a iPhone 4S (yet). I am not an early adopter; I would say I lag at the rightmost end of the early majority, just a tad away from the late majority. But it was enough for me to try Siri and the iPhone 4S while having dinner with one of my early adopter friends, to perceive the quality of the engineering work and its potential. I have been in speech recognition for nearly 30 years, and it is the first time I clearly perceive speech recognition is here to stay. Thanks Siri, thanks Apple, and thanks Steve Jobs.

Beer Gardens in New York and the Disappearance of Smiley's nose

2011-06-28T16:29:00.001-04:00

Some things appear all around, all of a sudden, with little or no warning at all. Beer Gardens in New York, for instance. A couple of beer gardens opened last summer, but this year they are mushrooming all over the city... At the same time, some things disappear all of a sudden, with little or no warning at all. The nose of the smiley is gone...did you notice that? If you still use three characters for a smiley like :-) ... well...you are passé... because smiley now has no nose at all ... :) ...

Twitter and Facebook: Information Crowdsourcing

2011-06-23T19:04:00.000-04:00

One of the many aspect of the social networks like Facebook and Twitter is that of helping share and distribute filtered information in an unprecedented manner. By browsing your Facebook and Twitter accounts you get to pieces of information posted or re-posted by your friends or the people you follow which you may not have found otherwise. And since you chose your friends and who to follow, you may get exactly what you know is interesting to you, because you know it is interesting to the people who have something in common with you.

I see a clear analogy between the concept of crowdsourcing--the exploiting of little pieces work by many people--and the selection and filtering of information operated by the social media. Information and news follow a long tail distribution. There are the relatively few topics and pieces of news that many people know and follows--the short head--and a virtually infinite number of things that few people follow--the long tail. Wading through the "almost infinite" long tail without any "recommendation" and finding things you may be interested in is and impossible job. While there are automatic recommendations systems, like those deployed by Amazon and Netflix, your friends and the people you follow on social media act as a natural human recommendation system. They help you navigate to the long tail by sharing and letting you find pieces of information you may not have been able to find otherwise.

What’s the “I” in “AI”?

2010-12-07T19:10:00.003-05:00

When people ask me what I do for a living I simply say: I do “AI”, as in Artificial Intelligence. In reality I always tried to stay away from the term AI until I realized it’s the easiest explanation to describe what I do, at least in a general sense and to the layperson. I don’t like terms like AI—likewise we do not like the now out-of-fashion term “electronic brain-- because words like “intelligence” easily create false expectations and confusion. First of all, how can we talk about Artificial Intelligence when we do not have the faintest idea what the natural one is? That vaguely reminds me Wittgenstein’s remark to a friend of his what told him she was so sick she felt “like a dog that has been run over “: You don’t know what a dog that has been run over feels like!”.

But, besides philosophical characterizations of imprecise generalizations, metaphors and analogies, after its boom in the 1970s, AI came to relate mainly to an approach to building “intelligent” machines which, in some way, mimicked what we believed the human intelligent process is. In other words, AI methods, for a long time, were based on a well defined inference process which, starting from some facts and observations, elegantly led to conclusion based on a more or less large set of rules. The rules were typically derived and painfully coded into a computer by “expert” humans. But, unfortunately, that process never really worked for building machines simulating basic human activities, like speaking, understanding language, and making sense of images.

A different approach, developed by people who humbly called themselves “engineers”—and Fred Jelinek, who sadly passed away last September, was one of them—did not have the pretense of “replicating human brain capabilities”, but simply approached human capabilities into machines, like the recognition and understanding of speech, from a statistical point of view. In other words, no rules compiled by experts, but machines that autonomously digest millions and millions of data samples, to then match them—in a statistical sense—to the observation of the reality and draw conclusion based on that. I belong to this school, and for a long time, like all the others in this school, I did not want to be associated with AI.

But today, probably, it does not make much sense to make a distinction anymore. The AI discipline has assumed so many angles, so many variations that it does not characterize anymore a “way to do things”, but the final result. The term AI, which had disappeared for a decade or more—during the so called AI winter—came back, probably resuscitated by Spielberg’s movie and, more and more, laypeople associate AI to building machines that (or should we say ‘who’?) try do what humans do…more or less. That is: speak, understand, translate, and draw conclusions from data. Unfortunately there are still some who want to make that distinction when they pitch their technology and say …we use AI …which I believe is nonsense. What’s the “I” in AI? So, yes … I work in AI…if you like that.

Inverse Reality

2010-08-17T13:23:00.000-04:00

Upside-down reality reflected in the lake of a tiny village on the Italian western Alps, (Valle D'Aosta, Lake Lod, Valtournenche). Reality blends with a reflection of it , reeds become leaves, falling from an improbable pastel blurred sky.

The Emptiness of the Language Universe

2010-07-24T14:43:00.007-04:00

As we know, the universe if pretty much empty. There is nothing there, really. The 10⁸⁰or so atoms that constitute the total amount of matter available to us are spread around in a vast universe of infinite emptiness, sometimes concentrated in stars and planet. In the most remote parts of the universe the chance of finding an atom is actually as remote as finding a needle in the ocean. There is emptiness at the cosmic scale, but also at the atomic scale. Every one atom is a universe of emptiness in itself. We have the perception of solid matter: the elasticity of my skin if I pinch myself, or the hardness of a marble tabletop. But that’s just an illusion created by the forces that keep the electrons orbiting around the nucleus and those keep the protons and the neutrons together. The reality is that in the electron orbits there is absolutely nothing, emptiness so to speak. Each electron is a universe away from the other and from the nucleus. Everything else is emptiness. If I were insensitive to the forces inside the atoms, I could penetrate through the walls, like a ghost; because there is nothing really there.

Actually the thought of the emptiness of the universe came to me in relation with an analogous emptiness: that of the universe of language. Borges’ library of Babel is a vast emptiness of nonsensical books, an universe of empty gibberish where only a tiny fraction of the books are actually readable. That fraction is way tinier than the actual density of matter in the universe.

Let's not even talk about books (each book in the Babel library has exactly 410 pages), but just a single page. Let's not even think about the whole Shakespeare, but just one sonnet that can comfortably fit in a double spaced page. Let’s say we have 30 characters in all: all the letters, punctuation, and space, and let’s say we can fit 1000 characters in a page. The number of possible pages of text obtained as all the possible random permutations of the 30 characters is 30 to the power of 1000 which is a number with a thousand zeros, more or less. A disproportionally huge number even compared with the number of atoms in the universe, which is only ten to the eighty. There won’t be enough atoms in the universe to print all these pages and not even to store them digitally. And only a teeny-tiny fraction of them actually make some sense, is somehow readable. That’s at the page or book scale. But language is empty at all scales, like the matter in the universe from the cosmic to the atomic one. At the lexical level, if you consider the actual words of your language, they are only a tiny fraction of all the possible words you can build with the symbols of your alphabet. At the syntactic level, the combinations of words which are actually grammatically correct are only an infinitesimal fraction of all the possible combinations of words. Same at the semantic level: among all of the syntactically correct sentences, those that actually make sense are just a tiny fraction. Like the universe and the atoms, the language we actually speak and write is only just a small, infinitesimally small, fraction of sense lost in a vast empty combinatorial universe. Everything else is emptiness.

The wick, the internet, and Andy Warhol

2010-06-19T10:10:00.004-04:00

I just finished reading Nicholas Carr’s “The Big Switch: Rewiring the World, from Edison to Google.” It is a book about technology, and how certain types of technologies, which are initially intended for individual fruition, gradually become utilities. Electric power is a classic example of that phenomenon. After the electric power generator became cost effective, factories and some households started to buy, own, and manage their own generators. But then someone—he was Edison’s personal secretary Samuel Insull—figured out that producing energy in a single place and distributing it for a fee would have been much more efficient and cheaper for everyone. In other words Insull created the “hosted” or “on demand” model for electric power: you pay for what you use. That was not the first example of this progression from individual usage to utility. Before him and before electric power was possible, Henry Burden built in 1852 a giant water wheel somewhere in upstate New York and used it to distribute mechanical power to the farms and factories in its neighborhood. And if you think about it, many other technologies followed the same path. Music: from individual gramophones to radio. Transportation: from individual coaches to public transports. And finally the computer: from individual computers to the cloud, or what Carr calls the World Wide Computer.

Interesting enough, computers started with the idea of being mainly a utility. Computers in the 1970s and 1980s were too expensive for individuals and small companies to own them, so they were deployed as mainframes in centralized locations, and computing power was distributed to its users for a metered fee on used CPU. But unfortunately, back then, data sharing was quite difficult because of the modest data communication bandwidth available. So yes, you could have your data stored in some central location, but uploading or download it from a remote location could take huge amounts of time, unless you could walk or drive to the central computing facility and give them your tapes or stacks of cards. That’s one of the reasons that made personal computing so popular. So we went back from a utility to individual deployment just because the infrastructure was not able to properly support the distribution. But now it is a different story. Bandwidth increased enormously during the past few years, and the idea of a centralized computer became all of a sudden, and again, a viable and attractive alternative to personal computers. In just a couple of years the term “cloud computing” started to acquire immense popularity, indicating a model where computer power (CPU cycles) is distributed from a virtually centralized location to everywhere in the world. It does not take much imagination to see a future not far from now where we won’t buy full-fledged computers for our home or offices, but simple appliances which, once connected to the network, will be able to use the virtually infinite computer power and storage provided by the Amazons and Googles of the world. Keeping tens of thousands of song tracks, pictures, and movies stored in our home computers as we have done during the past years—with the added complication of keping them sorted, backed up, etc.—starts feeling a little bit outdated and unnecessary once you start using services like Pandora, Flick, or Netflix.

But let’s get back to the thing I wanted to talk about: the wick. The epilogue of Carr’s book starts by saying the wick is one of man’s greatest inventions as well as one of the most modest ones. It allowed to move from the primitive torchlight to the more civilized candles which remained the dominant lighting technology for hundreds of years, only substituted by the wickless gas lamp and then by Edison’s incandescent bulb. Indeed the bulb did bring huge benefits to the society and the industry but, in Carr’s own words, it also brought subtle and unexpected changes to the way people lived. The candle constituted a focal point for families. In the evening families gathered around the flickering light of a candle to talk, tell stories, be together. With the advent of electric light, the family started to disperse around the house, and each family member started to spend more time in their own rooms or spaces during the evening.

Other technologies brought societal changes of the order of magnitude of those brought by electric light. I remember when I was a kid, there was only one television in my house—like in most of the other people’s houses—and at that time in Italy we had only two TV channels. The whole family was waiting for the time, after dinner, to gather around the only TV and watch the pretty much only choice of show, or movie, because the choice of which channel to watch was often obvious: a thriller on one channel, a documentary on shepherding on the other, the most popular quiz show of the decade on one channel, chamber music on the other one. The whole family sat in front of the TV watching in silence, all together every night. I even remember when we bought the first washing machine – the first day we all sat in front of it in awe, waiting for the washing program to change from prewash, to wash, and rinse. That night the excitement of the new machine surpassed that of the show on TV. Then, with more TVs in the house, and with so many channels to chose form, that moment of togetherness disappeared, and everyone was in their room after dinner, listening to music or watching their favorite show. And no…we don’t watch washing machines anymore …

The same thing happened with computers. At the beginning of my work career, in the early 1980s, my wife and I used to live in the same building where a friend couple lived. We and the other couple decided to share the costs and buy a Commodore 64 together. There, after dinner, we were gathering in one of the two apartments, and play with our first home computer which had a TV as its monitor, and a cassette recorder as the only storage device, substituted one year later by a brand new shining 5 inch floppy disk unit. Hours and hours together, sipping wine and playing the Aztec Tomb, Pac Man, and Mission 5 rescue. Then PCs became cheap, cheap enough for each one of us to have our own and we stopped getting together to play video games. Then Internet and the Web came. I remember Mosaic, the first browse. I compiled it and installed it on my remote UNIX machine first, and then on my home PC. The first days of the Web created such an intense sense of curiosity that it was not uncommon for the whole family to gather around the PC and browse. Then we got use to it. More PCs at home, wireless internet, everyone navigating to wherever they felt like.

On the one hand the personal computer and the internet greatly contributed, more than any other technology, to the dispersion of the social nuclei. Browsing the internet, listening to online streamed music, watching online movies, is a personal affair, not a tribe gathering moment. But paradoxically, on the other hand, the internet helped people get closer. Who hasn’t found lost friends and connected again with them on Facebook, LinkedIn, or Skype? Or just googled the names of that old buddy of yours and found that he is now a famous writer, or has a webpage with his picture and email address? Our societal rules and rituals and the way we connect with each other are changing forever because of the technology—the web—who had one of the most large-scale impacts on the whole humanity. So, paradoxically the same internet which brought us apart brings together. Not in the same way as the wick, but in a way which is completely different and new.

When I was reading Carr’s book discussing how internet is giving everyone in a unique opportunity to publish and share their writings, pictures, movies, etc. , my first thought was of Andy Warhol famous quote “In the future everyone will be world-famous for 15 minutes." I don’t know if Andy Warhol envisioned something as big as the web, but definitely that is what the web is bringing us: the potential to get our 15 minutes of fame.

Speech Recognition--continued

2010-06-05T09:17:00.002-04:00

Thanks to all who kindly commented, either privately or through this blog to my response to Robert Fortner's piece on speech recognition. For completeness, I am reporting here his comment, and my response to his comment.

On May 30th Robert Fortner said:

Hi, Roberto:
Thank you for reading and your impassioned comment.
I read your blog and you write "If you think that speech recognition technology, after 50 years of so of research, would bring us HAL 9000, you are right to think it is dead."
That's what I think!
You go on to say "that type of speech recognition was never alive, except in the dreams of science-fiction writers." I agree that SF writers were big purveyors of that dream, but I think a lot of other people believed in it too, maybe most people--and that's why the death of that dream has gone unrecognized. Nobody wants to talk about it. It's pretty shocking.
What do you mean computers aren't automatically (i.e. with a lot of work by smart people like you) going to progress to understanding language?
Hard to believe.

On May 30th Roberto Pieraccini said:

Hi Robert ... thanks for the response to my response to your blog ... I started working in speech recognition research in 1981 ... Since then I built speech recognizers, spoken language understanding systems, and finally those dialog system on the phone that some people hate and techies call IVRs.. (now I don't build anything anymore because I am a manager :) ) ... but during all this time I never believed I would see a HAL-like computer in my life time. And I am sure the thousands of serious colleagues and researchers in speech technology around the world never believed that either. At the end we are engineers who build machines. And as we get to realize the inscrutable complexity and sophistication of human intelligence (and speech is one of the most evident manifestations of that), and the principles on which we base our machines, we soon understand that building something even remotely comparable to a human speaking to another human is beyond the realm of today's technology, and probably beyond the realm of the technology of the next few decades (but of course you never know ... we could not predict the Web 20 years ago...could we?).
Speech recognition is a mechanical thing ... you get a digitized signal from a microphone, chop it in small pieces, compare the pieces to the models of speech sounds you previously stored in a computer's memory, and give each piece a "likelihood" to be part of that sound. Pieces of sounds make sounds, sounds make words, words make sentences, and you keep scoring all the hypotheses in an orderly fashion based on statistical models of larger and larger entities (sounds, words, sentences), such as models of the probability a sound following other sounds in a word, a word following other words in a sentence, and so on. At the end you come up with an hypothesis of what was said. And using the mathematical recipes prescribed by the engineers who worked that out, you get a correct hypothesis most of the times... "most of the times" ... not always. If you do the things right, that "most of the times" can become large ... but never 100%. There is never 100% in anything humans, or nature, make...but sometimes you can get pretty damn close to it..and that's what we strive for as engineers.
So, there is no human-like intelligence (God forbid HAL-like evil intelligence) in speech recognition. No intelligence in the traditional human-like sense ... (but ...what's intelligence anyway?). There is no knowledge of the world, there is not perception of the world, and having experienced and thought about the world for every minute of our conscious and unconscious life. Speech recognition is a machine which compares pieces of signal with models of them ... period. And doing that with the "statistical" way works orders of magnitude much better than doing it in a more "knowledge-based" inferential, reasoning way...I mean doing it in an AI-sh manner... We tried that--the AI-sh knowledge-based approach--very hard in the 1970s and 1980s but it always failed, until the "statistical" brute force approach started to prevail and gain popularity in the early 1980s. AI failed because the assumption on which it was based presumed you can put all the knowledge into a computer by creating rational models that explain the world...and letting the computer reason about it. At the end it is the eternal struggle between rationalism and empiricism .. .elegant rationalism (AI) lost the battle (someone think the battle .... not the war) because stupid brute-force pragmatic empiricism (statistics) was cheaper and more effective ...
So, if you accept that ...i.e. if you accept that speech recognition is a mechanical thing with no pretense of HAL-like "Can't do that Dave" conversations, you start believing that even that dummy mechanical thing can be useful. For instance, instead of asking people to push buttons on a 12 key telephone keypad, you can ask them to say things. Instead of pushing the first three letters of the movie you wanna see, you can ask them to "say the name of the movie you wanna see" (do you remember the hilarious Seinfeld episode were Kramer pretended he was an IVR system? http://www.youtube.com/watch?v=uAb3TcSWu7Q) ... and why not? if you are driving your car, you can probably use that mechanical thing to enter the new destination on your navigation system without fidgeting with its touch screen. And maybe, you may be able to do the same with your iPhone or Android phone. At the basis there is a belief that saying things is more natural and effective that pushing button on a keypad, at least in certain situations). And one thing leads to another...technology builds on technology...creating more and more complex things that hopefully work better and better. These are the dreams of us engineers ... not the dream of HAL (although I have to say that probably that dream unconsciously attracted us to this field). Why that disconnect between engineer's dreams and laypeople dreams? Who knows? But, as I said, bad scientific press, bad media, movies, and bad marketing probably contributed to that, besides the collective unconscious of our species, that of building a machine that resembles us in all our manifestations (Pygmalion?).
I am not sure about your last questions. What I meant is that computers *are* automatically going to progress in language understanding. But they are doing that by following "learning recipes" prescribed by the smart people out there and digesting oodles of data (which is more and more available, and computers are good at that). The learning recipes we figured out until now brought us so far. If we don't give up in teaching and fostering speech recognition and machine learning research, one day some smart kid from some famous or less famous university somewhere in the world will figure out a smarter "recipe"... and maybe we will have a HAL-like speech recognizer .. or something closer to it...

Un-rest in Peas: The Unrecognized Life of Speech Recognition (or “Why we do not have HAL 9000 yet”)

2010-05-30T10:02:00.005-04:00

I read Robert Fortner’s blog post on the death of speech recognition as soon as it came out, about a month ago. For several reasons, it took me a while to decide to craft a response, not last my eternal laziness and procrastination attitude. But having a few hours made available from an inspired jet-lagged insomnia I decided to go ahead and write about it.

I have to admit that speech recognition research and the use of it has occupied more than half of my life. So, I am personally mentally and sentimentally attached to speech recognition. But at the same time I am frustrated, disappointed, disillusioned at times. Speech recognition is a great technology, I made a career out of it, but yes…it does make mistakes.

What I felt when I read Fortner’s post was no surprise. We all may feel that speech recognition did not keep up with its promises. Even us, who have been working on it and with it for decades, sometimes feel that sense of failure of an unrealized dream. But writing eulogies for an undead person is not fair. First of all speech recognition--the undead one--is not, and does not want to be, what laypeople think it is. If you think that speech recognition technology, after 50 years of so of research, would bring us HAL 9000, you are right to think it is dead. But that type of speech recognition was never alive, except in the dreams of science-fiction writers. If you think that that’s the speech recognition we should have strived for, yes … that’s dead. But I would say that that dream was never alive in any reasonable way for most of the speech scientists—so they call us geeks who have dedicated time and cycles to making computer recognizing speech. We all knew that we would never see a HAL 9000 in our lifetimes.

Saying that speech recognition is dead because its accuracy falls far short of HAL-like levels of comprehension is like saying that aeronautical engineering is dead because commercial airplanes cannot go faster than 1,000 miles per hour, and by the way … they cannot get people to the moon. Similarly we can say that medicine is dead because we cannot always cure cancer, or that computer science is dead because my PC gets jammed and I have to reboot it now and then. There are limitations in any one of our technologies, but the major limitations we perceive are the result of our false assumptions of what the goals are, our wrong use of the technology, and the wrong promises divulged by pseudoscientific press and media. Speech recognition is not about building HAL 9000. Speech recognition is about building tools, and as all tools, it may be imperfect. Our job is trying to find a good use of an imperfect, often crummy, tool that can sometimes make our life easier.

Robert Fortner’s blog post captures some truths about speech recognition, but it is more like a collection of data read here and there outside of the context. I would probably make the same mistakes if I read a few papers on genetic research and tried to write a critique of the field. For instance, among many other things, it is not accurate to say that the accuracy of speech recognition flat- lined in 2001 before reaching human levels. It is true that “some” funding plugs were pulled—mainly the DARPA funds on interactive speech recognition projects mostly devoted to dialog systems. But 2001 was a year of great dramatic changes for many things. 9/11 brought the attention of the funding agencies on some more urgently important tasks than that of talking to computers for making flight reservations. The funding plug on speech recognition technology was not pulled, but the goals were changed. For instance, DARPA itself started a quite large project, called GALE (as in Global Autonomous Language Exploitation) one of which goals was to interpret huge volumes of speech and text in multiple languages. And of course, the main purpose of that was for homeland security. The amount of audio information available today—Web, broadcasts, recorded calls, etc.—is huge even compared with the amount of text. Without speech recognition there is no way we can search through it, so all potentially useful information is virtually unavailable to us. And that is worsened by the fact that the vast majority of the audio available around us is not in English, but in many other languages for which a human translator may not be at hand when and where we want it. Now, even an imperfect tool such as speech recognition, associated to an imperfect machine translation, can still give us a handle to search and interpret vast amounts of raw audio. An imperfect transcription followed by an imperfect translation can still give us a hint on what’s that about and maybe help us select a few audio samples to be listened to or translated by a professional human translator. That’s better than nothing, and even a small imperfect help can be better than no help at all. GALE, and similar projects around the world gave rise to new goals for speech recognition and to a wealth of research and studies which is reflected in the rising number of papers at the major international conferences and journals. Large conferences like ICASSP and Interspeech, like the dozens of specialized workshops, attract thousands of researchers around the world every year. And the number is not declining. Other non-traditional uses of speech recognition and speech technology in general emerged like, to cite a few, emotion detection, or even the recognition of deception through speech analysis, which proves to be more accurate than human’s (apparently computers can detect layers from their speech better than parole officers, who do a measly 42%...)

The number of scientific papers on speech recognition—and not just HAL-like speech recognition—rose continuously since scientists and technologists started to look at that. The following chart shows the number of speech recognition papers (dubbed ASR, as in Automatic Speech Recognition) as a fraction of the total number of papers presented at the ICASSP conference from 1978 to 2006:

Kindly made available by Prof. Sadaoki Furi, from the Tokyo Institute of Technology

The number kept growing after 2006, and it shows similar figures for other conferences. So speech recognition is not dead.

Just to cite another inaccuracy of Fortner’s blog post—one that particularly touched me—is the mention of the famous sentence “Every time I fire a linguist my system improves” said—and personally confirmed—by one of the fathers of speech recognition, then the head of speech recognition research at IBM. The meaning of that famous, or in-famous—sentence is not a conscious rejection of the deeper dimensions of language. Au contraire. It is the realization that classic linguistic research, based on rules and models derived by linguist’s introspection can only bring you so far. Beyond that you need data. Large amounts of data. And you cannot deal with large amounts of data by creating more and more rules in a scholarly amanuensis manner; you need some powerful tool that can extract information from larger and larger amounts of data without an expert linguist having to look at it bit by bit. And that tool is statistics. In the long run statistics showed to be so powerful that most linguists became statisticians. The preponderance of statistics in linguistics today is immense. It is enough to go to a conference like ACL (the annual conference of the Association for Computational Linguistics) to see the amount of topics that used to be approached by traditional linguistics and are now the realm of statistics, mathematics, and machine learning. We would not have Web search if we approached it with traditional linguistic methods. Web search makes mistakes, but yet it is useful. We would not have the Web if we did not have statistically-based Web search. We would not have speech recognition, nor machine translation and many other language technologies, with all their limitations, if we did not abandon the traditional linguistic way and embraced the statistical linguistic way. And by the way … that famous speech recognition scientist (whose name is Fred Jelinek, for the records) gave a talk in 2004 entitled “Some of my best friends are linguists.”

Let’s talk now about speech recognition accuracy. The most frustrating perception of speech recognition accuracy—or the lack of it—is when we interact with its commercial realizations: dictation and what people of the trade call IVR, or Interactive Voice Response.

Half of the people who commented on Fortner’s blog are happy with automated dictation and have been using it for years. For them the dictation tool was well worth the little time spent in learning how to use it and training it. It is also true that many people tried dictation and it did not work for them. But most likely they were not motivated to use it. If you have a physical disability, or if you need to dictate thousands of words every day, most likely speech recognition dictation will work for you. Or better, you will learn how to make it work. Again, speech recognition is a tool. If you use a tool you need to be motivated to use it, and learn how to use it. And I repeat it here…this is the main concept, the take away. Speech recognition is a tool built by engineers, not an attempt to replicate human intelligence. Tools should be used when needed. We need a little patience, use them for the goal they were designed for, and they can help us.

And let’s get now to IVR systems, those you talk to on the phone when you would like instead to talk to a human; with no doubt they are the most pervasive, apparent, and often annoying manifestation of speech recognition and its limitations. They are perceived so badly that even Saturday Night Live makes fun of them. There is even a Web site, gethuman.com, which regularly publishes a cheat sheet to go around them and get a human operator right away. But are they so bad? After all, thanks to speech recognition, hundreds of thousands of people can get up to the minute flight information right away by calling the major airlines, or make flight and train reservation, get information from their bank accounts, and even get a call 24 hours before their flight and check in automatically even if they are not connected to internet. Without speech recognition that would require hundreds of thousands of live operators— an unaffordable cost for the service providing companies—and tens of minutes waiting in queue listening to wait music for their customers. Yet these systems make irritating mistakes, sometimes. But they turn out to be useful for most of the people, hundreds of thousands of them. And I go again with leitmotif. They are tools, and tools can be useful when used properly and when truly needed.

In 2001 recognition accuracy topped out at 80%, far short of HAL-like levels of comprehension. What does recognition accuracy mean? As all measures do, it means nothing outside a context. Recognition accuracy is measured in different ways, but it most of the cases it measure how many words does the recognizer get wrong as a percentage of all words recognized. But that depends on the context. In IVR systems speech recognizer can get very specialized. The more they hear, the better they are. I work for a company called SpeechCycle. We build sophisticated speech recognition systems that help the customers of our customers—typically service providers—get support and often solve problems. Using statistics and lots of data we have seen the accuracy of our speech recognition interactive systems grow continuously in time and get better and better as the speech recognizer learned from the experience of thousands of callers. I am sure other companies that build similar system—our competitors--would claim a similar trend (although I am tempted to say that we do a little better than them …). As Jeff Foley—a former colleague of mine—said in his beautifully articulated answer to Fortner’s blog post “[…] any discussion of speech recognition is useless without defining the task […].” And I especially like Jeff’s hammer analogy: This is like saying that the accuracy of a hammer is only 33%, since it was able to pound a nail but failed miserably at fastening screws and stapling papers together.

In specialized tasks speech recognition can get well above the 80% accuracy mentioned in Fortner’s blog post, which refers to a particular context, that of for a very large vocabulary open dictation task. By automatically learning from data acquired during the lifecycle of a deployed speech recognizer you can get to the 90s on average and to the high 90s in specially constrained tasks (if you are technically inclined you can see, for instance, one of our recent papers). With that you can build useful applications. Yes, you get mistakes now and then, but they can be gracefully handled by a well designed Voice User Interface. In particular, recognition of digits, and command vocabularies, like yes and no today can go beyond 99% summing up to one error every few thousands of entries, which can be automatically corrected if other information is taken into account, like for instance the checksum on your credit card number. Another thing to take into consideration is that most errors are not because speech recognition sucks, but because people, especially occasional users, do not use speech recognition systems for what they were designed. Even if they are explicitly and kindly asked to respond with a simple yes or no, they go and say something else, like if they were talking to an omnipotent human operator. It is like if you went to an ATM machine, and enter the amount you want to cash in when you are asked for your pin … and then you complain because the machine did not understand you! I repeat it again: speech recognition is a tool, not HAL 9000, and as such users should use it for what it is designed for and follow the instructions provided by the prompts; it won’t work well otherwise.

And finally I would like to mention the recent surge of speech recognition applications for voice search and control on smartphones (Google, Vlingo, Bing, Promptu, and counting). We do that at SpeechCycle too, in our customer care view of the world. This is a new area of application of speech recognition which is likely to succeed because, in certain situations, speaking to a smartphone is certainly more convenient than typing on it. The co-evolution of vendors who will constantly improve their products, and motivated users who will constantly learn how to use voice search and cope with its idiosyncrasies (like we do all the times with PCs, word processors, blog editing interfaces, and every imperfect tool we are using to accomplish a purpose) will make speech recognition a transparent technology. A technology we use every day and we are not anymore aware of, like the telephone, the mouse, internet, and everything else that makes our life easier.

High Dynamic Range Photographs

2010-05-18T03:09:00.028-04:00

Pescadero, California (Roberto Pieraccini)

I discovered High Dynamic Range (HDR) photography recently. The idea is to give pictures a higher range of light intensity between the lightest and darkest areas of an image. Standard photography allows for a low range of luminosity--lower than the human eye--so if you expose for a luminous sky, the landscape becomes too dark, and if you expose for the landscape, the sky becomes overexposed. The solution consists in shooting two or more identical pictures with different exposure, say one for the sky and one for the landscape, and then merge them. The result can be pleasing for the eye...sometimes cheesy ... the art, in my opinion, consists in going for the former while avoiding the latter ...

San Casciano Val di Pesa, Tuscany, Italy. (Roberto Pieraccini)

The end of symbolic communication

2009-07-16T19:33:00.004-04:00

A few years ago, probably more than 10, I happened to listen to Jaron Lanier, Virtual Reality's philosopher and visionary, talk about the end of symbolic communication, or how he called it, the post-symbolic age. Practically he envisioned, 300 to 500 years from now, a world where we will not need to talk by using our old and archaic symbolic language just because we will be able to transfer concepts directly from one mind to the other using "conceptual" authoring tools. So, 300 years from now, or a little bit later, I will be able to tell stories to my friends by creating images and movies in real time, and in some other way which ahs not been invented yet I will be able to convey abstract concepts too. Now, at that time, that idea did not really resonate with me, until when a few days ago I downloaded, on my iPhone, a couple of applications: Sahazm and SnapTell. With Shazam you can have your phone listen to some songs (it works better with American pop songs...) and it will tell you the title, the artist, and create direct links to Web sites where you can actually buy the CD or the tracks. In a similar way SnapTell allows you to take pictures of book covers and CDs (again ...it works better with American books and CDs...), and in exchange it will give you back titles, authors, and a link to Amazon where, with a simple click, you can buy the item. And it does not take a lot of imagination to predict that things like Shazam and SnapTell will extend to movies, any objects in general, and everything else. Wow! That's impressive. Think about it for a minute: I can listen to a song, see a book in a book store, and buy them right away without going through any symbolic language. SnapTell and Shazam bypass the hurdle of language, and go directly from sounds and images to complete transactions. Is this the beginning of the end of symbolic communication?