Twenty years ago, if the radio played a song you liked, you had to put your ear to the speaker and hope the DJ said the title and artist so you could go find their album at the store.
Now, with technologies like SoundHound and Shazam, consumers have gotten used to just plucking that information out thin air and immediately being able to convert it into actionable options: the ability to buy the song or album on iTunes, or the ability to add it to a Pandora or Spotify playlist.
This encapsulates what Katie McMahon, VP and GM of SoundHound, means by “speech-to-meaning,” a conversational pattern that makes talking to the Internet of Things natural, contextual and fast. Speech-to-meaning leapfrogs the two-step process of speech-to-text and text-to-meaning that most of today’s voice recognition technologies follow, increasing the accuracy and speed of interactions.
In a recent Matchmakers interview with Karen Webster, McMahon said that speech-to-meaning is one of the features that sets SoundHound apart from tech giants Amazon and Google. The other is the company’s open voice AI platform, Houndify – an offering 13 years in the making, which brands can white-label to voice-enable their products.
Here’s how McMahon says SoundHound is pursuing ubiquity with the Houndify platform – albeit, she contends, more quietly and steadily than its tech giant counterparts.
Most voice recognition tools listen to what a user is saying and spit out a text transcription. Then, a natural language understanding engine looks at the text to determine its meaning and intentions.
Contextual awareness, McMahon told Webster, was always going to be the most challenging part of building out voice technology, and that is what gets lost in this two-step process. Furthermore, an error in step one will carry over into step two, creating inaccuracy and latency in the response.
She said that’s why SoundHound’s Houndify platform has its own automatic speech recognition program and its own natural language understanding engine that work in tandem with each other. To really understand tech, McMahon said, a company must build it in-house; otherwise it’s licensing a black box from somebody else and cannot fully own the function or experience.
Using SoundHound’s own technology, she said, Houndify can understand and respond to a compound, complex command, such as one might hear in a live conversation between two people – something like, “What is your favorite restaurant that has outdoor seating and is open past 11:00?”
Amazon’s Echo, said McMahon, will rightly go down in the history books as the first iconic device to which people spoke with their backs turned, similar to how Apple’s iPhone went down as the first iconic smartphone. But, like the iPhone, the Echo may be rivaled and eventually surpassed by platforms that come after it.
One of those platforms, she contends, is Houndify.
Going to Market
Google Assistant and Amazon Alexa both went to market by inviting lots of players to connect to the platform and leverage it in new environments. By contrast, SoundHound is making Houndify available to environments to take and build their own ecosystem around it.
Instead of creating a useful personal assistant, which thrusts the customer into a relationship with the platform and brand, Houndify offers a feature or capability that can exist inside of other brands’ platforms, which creates a very different relationship for consumers – and the brands that wish to offer a voice-initiated way to reach them.
McMahon said that enabling an open platform means that SoundHound remains focused on the success of its partners rather than the proliferation of its own name and platform.
“It is really important not to have an agenda,” McMahon said. She said it’s the difference between helping the company build out its own presence versus channeling customers into a closed ecosystem for eCommerce, or transitioning into search and advertising.
For instance, in helping Hyundai with its in-vehicle voice system, she said it was important for the customer to feel he or she is talking to Hyundai, not SoundHound. Putting the emphasis on SoundHound disintermediates partners from their customers, which McMahon said is never acceptable.
“Our vision is to allow corporations and even small developer shops to stand on the shoulders of this hard work, whether they’re in auto, robotics, consumer electronics or mobile apps,” McMahon said. “The Internet of Things does every vertical, and if companies aren’t thinking about that right now, I’ll wager that company won’t be relevant in five years – just like those who missed out on the mobile transition.”
How Natural Is Natural Speech?
In 200 years, predicts McMahon, people will be interacting with things in their homes, mobility pods and robots. They won’t be calling those technologies “Google” or “Alexa,” she said. Each one will have its own name and its own specialized function. They may not run on Houndify, but they’ll be white-labeled in some form or another.
Within the Houndify platform, there are domains of knowledge, or sub-engines. Users can ask what song is playing, or they can tap the orange button to hum a few bars of the song that’s stuck in their head, or they can go text-based with their queries. McMahon said that multimodal and specialized functions like these will be critical to the future of voice infrastructure.
Today, however, it’s still early days for the voice ecosystem. The technology isn’t quite there to deliver a perfect experience, which results in users dumbing themselves down and dulling their expectations around what they can do with a product.
Like SoundHound’s climb from music recognition to full-on voice ecosystem, getting from today’s simplistic robots to Rosie from “The Jetsons” must necessarily be a gradual process that builds on past innovations – even if it feels like the space is evolving faster every day.