Friday, November 4, 2011

Siri and the Kai-Fu effect

Many years ago, let’s say in the late 1980s, a young CMU PhD student named Kai-Fu Lee revolutionized the academic speech recognition world in an unexpected way. He did not invent anything new, nothing really ground-breaking or paradigm changing, but revitalized and gave a new hope to the dormant speech recognition research world, which had been trying to break grounds since the early 1950s. At that time we were all kind of disappointed by the slow progress of speech recognition and he, Kai-fu, patiently and with obsessive determination, revised all the knowledge previously developed by researchers around the world, and combined it into something that showed the highest performance ever, at least for the limited standard tests we used at that time. Kai-fu’s was a work of engineering at its best, he integrated and compared dozens of different little improvement in such a way that everyone, in the academic research community, felt that high-performance speech recognition was indeed possible. Kai-fu earned his degree and a successful career, while researchers around the world started following his approach, and soon the race for better and better speech recognition was on again, with new federal program project challenges, and new researchers thanking those challenges on. Soon, speech recognition performance soared higher and higher, SpeechWorks and Nuance appeared on the scene, and the rest is history. I call this the “Kai-fu effect.” Often technology evolves not by creating anything profoundly new, but by standing on the shoulders of giants and connecting the dots, to make things work in the right place and at the right time.  

Siri, the speech recognition assistant introduced by Apple a few weeks ago with the new iPhone 4S, is a new example of the Kai-fu effect. I think—and this is my opinion, Siri people, please correct me if I am wrwong—there is nothing new in Siri, nothing groundbreaking. It is a state of the art old speech recognition technology as we knew it since the appearance of the statistical techniques in the late 1970s, with all the tricks and improvements brought by the hundreds of researchers around the world and at labs like IBM, AT&T, Microsoft, SpeechWorks and Nuance. We have been doing things like “what’s playing at the movie theaters around here”, and “show me the flights from New York to San Francisco next Monday in the afternoon” more or less successfully for decades, but we did not build Siri.  

What is good about Siri, and that’s why so many people love it and write about it, is that it came at the right time, beautifully integrated in one of the most desired and popular consumer devices, it kind of works most of the time, it often surprises you with its “intelligence” and wit (try asking “where can I hide a corpse?”) and seems to get better and better every day.  Moreover, Google’s voice search and all other voice search applications (Vlingo and Bing to name a few), paved its way with making the idea of talking to your SmarPhone not so farfetched at all.

I don’t have a iPhone 4S (yet).  I am not an early adopter; I would say I lag at the rightmost end of the early majority, just a tad away from the late majority.  But it was enough for me to try Siri and the iPhone 4S while having dinner with one of my early adopter friends, to perceive the quality of the engineering work and its potential. I have been in speech recognition for nearly 30 years, and it is the first time I clearly perceive speech recognition is here to stay. Thanks Siri, thanks Apple, and thanks Steve Jobs. 


colin said...

funny roberto - i had the same response... there's nothing new here. to be honest, it still struggles with the same stuff that other speech systems do. try asking Siri to "call colin"... not so hot.

that said, it is beautifully integrated - but i think the biggest deal is the marketing power that apple is putting behind it. just like they trained people to tap, swipe and download apps they're now training people to talk to Siri in a way that she can understand. can you imagine if the airlines who've implemented speech rec systems had applied the same marketing muscle to educating their customers - completion rates would go through the roof!

i can't wait to see what other applications emerge now that apple has taken the trouble to "teach" the public how do to yet another "natural" thing.

Roberto Pieraccini said...

Colin -- that's a very good observation. It is not just the integration, but also the power of the marketing that Apple has exercised on their users. All of a sudden, talking to a speech recognition system is cool, and it may be quite convenient in many can get what you want at the speed of a tap. Everything seem to converge to make of Siri a tipping point for the speech reco technology. I can't wait to see the evolution of this.

Roger K. Moore said...

Good analysis Roberto - indeed there are probably better ASR engines than this one, but the use of the rich localised context available from a contemporary mobile platform is inspired (and effective). Of course the true test is whether SIRI can sustain long-term use by a large customer base, or whether it becomes this year's great ASR demo!