The biggest imagination-capturing advances in generative artificial intelligence (AI) so far have been across text-based interfaces.
Even AI-generated images and videos are having a moment. But humans could speak to one another before they could write, draw or do much else.
So why is it that the oldest form of interpersonal interaction — the voice — has so far failed to find the same success as its subsequent counterparts when AI was designed to recreate human intelligence from the ground up?
This, as the most valuable company in the world, Apple, is reportedly spending millions of dollars a day building out its generative AI capabilities across several product teams. A major focus of the tech giant’s initiative is giving its intelligent voice assistant, Siri, a next generation upgrade.
And the embedded voice tool sorely needs it. Most voice assistants today, including those from Amazon and Google, still struggle to move beyond a core set of applications like playing music, turning lights on and off, telling their owners the weather or stock prices, and relaying other information directly from a website. Even the promising area of voice-activated connected commerce has yet to be fully scratched by today’s platforms.
In a sign of the challenging times for even the most commercialized voice assistants, Google and Amazon last week (Aug. 30) announced that their two voice assistants, Alexa and Google Assistant, can now be used – for the first time – simultaneously on the same device, a new line of JBL smart speakers from Harman.
What will it take for voice assistants to fulfill their promise of establishing an entirely new user interface and engagement ecosystem?
The biggest tech players appear to be betting on the catch-all capabilities of generative AI.
When voice assistants were first launched by four of the world’s top five most valuable companies — Apple, Google, Microsoft and Amazon — in the mid-2000’s, they were greeted with a similar level of buzzy excitement that text-based large language models (LLMs) like OpenAI’s ChatGPT and others are enjoying now.
But the utility of voice-activated assistants, while convenient, proved to be limited. And the buzz quickly mellowed. Microsoft began to sunset its own voice assistant, Cortana, in 2019, and fully shut down the app this year as part of a broader transition to its GenAI copilot solution.
That’s why the integration of generative AI into smart voice assistants holds tremendous promise for upgrading these digital companions. In doing so, tech giants can offer enhanced conversational abilities, personalization, multilingual support and greater entertainment value.
Apple’s expanded AI budget is meant to allow Siri users to automate tasks involving multiple steps, such as asking Siri to find a stored address and then text it to a contact.
And with Siri already deployed across billions of iPhones, Apple is incentivized to make its assistant as valuable and integrated into end-users’ lives as possible.
Amazon and Google are seeking to do likewise with their smart home devices and other connected products.
PYMNTS Intelligence unveiled that consumers may still carry hesitation about the reliability and safety of voice technology.
But as the AI-powered tools get ‘smarter’ and more available, those views could change, and the tech may become a more regular tool in everyday life.
In the report “How Consumers Want to Live in the Voice Economy,” PYMNTS Intelligence finds that over half of consumers (54%) would prefer voice technology in the future because it is faster than typing or using a touchscreen.
Enticingly, generative AI can handle multiple modes of communication simultaneously. This means that voice assistants can not only understand spoken language but also process text input and even images or gestures.
Nearly 6 in 10 consumers (58%) would use voice technology for the ability to complete tasks faster, easier and more efficiently, and many believe that it will be less than five years until voice recognition technology is advanced enough to make speaking to voice assistants comparable to speaking with actual humans.