Apple researchers have unveiled an artificial intelligence (AI) system capable of interpreting ambiguous references and contextual cues. The system could revolutionize voice assistant interactions and potentially reshape the commerce landscape.
The system, called ReALM (Reference Resolution As Language Modeling), simplifies the complex process of understanding screen-based visual references into a language modeling task using large language models. It’s part of a growing number of attempts to enhance AI voice communications that could boost commercial applications.
“On the one hand, if we have better, faster customer experience, there’s a lot of chatbots that just make customers angry,” AI researcher Dan Faggella, who is not affiliated with Apple, told PYMNTS. “But if in the future, we have AI systems that can helpfully and politely tackle the questions that are really quick and simple to tackle and can improve customer experience, it is quite likely to translate to loyalty and sales.”
The voice technology sector is on the rise. According to a study by PYMNTS, there’s a notable interest among consumers in voice technology, with over half (54%) looking forward to using it more in the future due to its rapidity. Additionally, 27% have interacted with voice-activated devices in the last year, and 22% of Gen Z are open to spending more than $10 each month for a premium voice assistant service.
Conversely, a PYMNTS report focusing on U.S. consumers indicated a certain level of skepticism concerning the efficiency of voice AI in fast-food establishments compared to human service. A small fraction (8%) believe voice assistants currently match human capabilities, with only 16% optimistic that this parity could be achieved in the next two years. The majority are either bracing for a longer wait or are skeptical about voice AI reaching a level of reliability and intelligence comparable to humans.
According to the company’s research paper published on the open-access publishing platform arXiv, Apple’s breakthrough in natural language understanding is rooted in its ability to handle pronouns seamlessly and implied references in conversations. This issue has been a significant challenge for digital assistants as they struggle to process audio cues and visual contexts.
Apple’s ReALM project tackles this by treating reference resolution as a language modeling task, the researchers wrote. This technique allows the system to understand and respond to mentions of visual elements on a screen, integrating this skill smoothly into conversations.
The core of ReALM is an innovation that converts a screen’s visual layout into structured text, the researcher said. It identifies and locates on-screen elements and then translates these visual signals into a textual representation that captures the screen’s content and arrangement. With tailored language model training improvements for reference resolution, Apple’s approach outperforms traditional methods, including those using OpenAI’s GPT-4.
Apple’s new solution could solve the context problem for voice communications. Daniel Ziv, vice president, Experience Management and Analytics, GTM Strategy at Verint Systems, told PYMNTS that understanding context is critical.
Spoken conversations typically have a lot of pauses, filler words such as “um,” and other conversational distractions that can impact understanding of context. To fully understand context, humans consume a lot of additional background data that occurs outside of the actual conversation. These conversational factors make it difficult for AI to discern context and words from noise and distractions in a conversation.
“Today, generative AI has become much better at understanding context than previous AI models,” he said. “Generative AI can effectively summarize and then identify key issues within voice conversations. Based on the extensive training, generative AI can also use additional information outside of the conversation to fill in the relevant context. This sometimes can cause hallucinations, but models are getting better.”
The biggest drawback of communicating with AI through voice is AI’s inability to be empathetic, Nikola Mrkšić, CEO and co-founder of PolyAI, an AI conversation platform for enterprise, told PYMNTS. He noted that AI struggles to replicate human empathy and emotional intelligence, which can make interactions feel cold and impersonal, especially when dealing with complex or emotional topics.
“If someone crying calls an AI-powered customer service line, the AI will treat them exactly the same as any other caller because that’s what it’s programmed to do,” he added. “Additionally, as with all technology, there are security risks associated with unsecured voice AI. Those implementing voice AI must be wholly cognizant of the technology’s limitations and recognize the likely need for appropriate safeguards.”
Apple is talking with Google to incorporate the latter’s AI engine into the iPhone, a move that could have a big impact on the AI industry, according to a report by Bloomberg News on March 18.
Sources familiar with the matter have revealed that Apple is negotiating to license Google’s Gemini AI models to enhance new iPhone software features scheduled for release this year. Additionally, Apple has recently engaged in discussions with OpenAI and considered using its AI model.
The potential deal would provide Gemini access to billions of users, but it may also indicate that Apple is lagging in its AI development, as noted in the Bloomberg report. Furthermore, a partnership between the two tech giants could attract increased scrutiny from antitrust regulators.
Last year, PYMNTS reported on Apple’s more subdued approach to AI compared to its counterparts, Google and Microsoft, despite the company’s enthusiasm for the technology. CEO Tim Cook has stated that AI and machine learning are “virtually embedded in every product,” but the company is implementing AI in a “very thoughtful manner.”