Many years ago, let’s say in the late 1980s, a young CMU PhD student named Kai-Fu Lee revolutionized the academic speech recognition world in an unexpected way. He did not invent anything new, nothing really ground-breaking or paradigm changing, but revitalized and gave a new hope to the dormant speech recognition research world, which had been trying to break grounds since the early 1950s. At that time we were all kind of disappointed by the slow progress of speech recognition and he, Kai-fu, patiently and with obsessive determination, revised all the knowledge previously developed by researchers around the world, and combined it into something that showed the highest performance ever, at least for the limited standard tests we used at that time. Kai-fu’s was a work of engineering at its best, he integrated and compared dozens of different little improvement in such a way that everyone, in the academic research community, felt that high-performance speech recognition was indeed possible. Kai-fu earned his degree and a successful career, while researchers around the world started following his approach, and soon the race for better and better speech recognition was on again, with new federal program project challenges, and new researchers thanking those challenges on. Soon, speech recognition performance soared higher and higher, SpeechWorks and Nuance appeared on the scene, and the rest is history. I call this the “Kai-fu effect.” Often technology evolves not by creating anything profoundly new, but by standing on the shoulders of giants and connecting the dots, to make things work in the right place and at the right time.
Siri, the speech recognition assistant introduced by Apple a few weeks ago with the new iPhone 4S, is a new example of the Kai-fu effect. I think—and this is my opinion, Siri people, please correct me if I am wrwong—there is nothing new in Siri, nothing groundbreaking. It is a state of the art old speech recognition technology as we knew it since the appearance of the statistical techniques in the late 1970s, with all the tricks and improvements brought by the hundreds of researchers around the world and at labs like IBM, AT&T, Microsoft, SpeechWorks and Nuance. We have been doing things like “what’s playing at the movie theaters around here”, and “show me the flights from New York to San Francisco next Monday in the afternoon” more or less successfully for decades, but we did not build Siri.
What is good about Siri, and that’s why so many people love it and write about it, is that it came at the right time, beautifully integrated in one of the most desired and popular consumer devices, it kind of works most of the time, it often surprises you with its “intelligence” and wit (try asking “where can I hide a corpse?”) and seems to get better and better every day. Moreover, Google’s voice search and all other voice search applications (Vlingo and Bing to name a few), paved its way with making the idea of talking to your SmarPhone not so farfetched at all.
I don’t have a iPhone 4S (yet). I am not an early adopter; I would say I lag at the rightmost end of the early majority, just a tad away from the late majority. But it was enough for me to try Siri and the iPhone 4S while having dinner with one of my early adopter friends, to perceive the quality of the engineering work and its potential. I have been in speech recognition for nearly 30 years, and it is the first time I clearly perceive speech recognition is here to stay. Thanks Siri, thanks Apple, and thanks Steve Jobs.