PYMNTS MonitorEdge May 2024

OpenAI, Google Double Down on Visuals With Multimodal AI

OpenAI, Google Double Down on Multimodal AI

In the cutthroat world of artificial intelligence, tech behemoths are betting big on a new frontier: multimodal AI.

As the shine of text-based chatbots dims, companies are gambling that the future belongs to AI assistants capable of seeing, hearing and conversing with users more naturally and intuitively. The battle for AI dominance has taken on a new dimension, with “multimodal” emerging as the latest buzzword.

“The usefulness of multimodal AI comes from its ability to simultaneously process and analyze diverse types of data, such as text, images, audio and video,” ComplyControl Chief AI Officer Mikhail Dunaev told PYMNTS.

This multifaceted approach mirrors human cognition, amplifying the AI’s versatility and enabling it to tackle a broader range of tasks with human-like proficiency.

Tech Titans Enter the Fray

OpenAI fired the opening salvo May 13 with the unveiling of GPT-4 Omni, a nod to the model’s “omnichannel” capabilities. In a demo eerily reminiscent of the sci-fi film “Her,” ChatGPT analyzed a math problem through a phone camera while an OpenAI staff member verbally asked for guidance. The seamless video and audio processing integration, now available to Premium users, marked a leap forward.

Not to be outdone, Google responded swiftly with Project Astra. At Google’s I/O developer conference, the company unveiled Project Astra, a next-generation AI assistant. Demonstrated on a smartphone app and smart glasses, Astra delivers on Google DeepMind Co-founder and CEO Demis Hassabis’ promise of Gemini’s potential from last December.

Astra responded to spoken commands, identifying objects and scenes via the devices’ cameras and engaging in natural language conversations. It identified a computer speaker, recognized a London neighborhood through an office window, read and analyzed code from a screen, composed a limerick about pencils, and even remembered where a pair of glasses were left.

Jure Leskovec, AI expert and co-founder of Kumo AI, told PYMMNTS that multimodal AI plays a critical role in solving practical problems.

“Multimodal AI is useful because many real-world problems are multimodal, and AI needs to reason across multiple data points that come in different data modalities to make the correct decision, prediction or inference,” he said.

He cited medical diagnosis as a prime example, where accurate assessment requires reasoning over multimodal data such as medical imaging, EKG, electronic health records and clinical notes.

The implications for commerce are far-reaching and profound. Raghu Ravinutala, CEO and Co-founder at, told PYMNTS that multimodal AI can eliminate friction between a user’s request and the AI’s output, streamlining the process for all parties involved.

“Multimodality opens pathways to targeting a wider, more diverse customer base,” Ravinutala said, noting that it facilitates dynamic engagement across voice and imagery.

The Multimodal Revolution Begins

Experts say multimodal AI enhances productivity within companies by empowering employees to communicate ideas effectively through various media. This leads to swifter problem-solving, improved decision-making and streamlined information exchange. Additionally, it revolutionizes training and onboarding processes by enabling active engagement with multimedia content, accelerating the learning curve for new hires.

Wonderslide CEO Renat Abyasov told PYMNTS that multimodal systems are already being deployed in eCommerce to augment search capabilities. Understanding text and images proves these systems invaluable in indexing and interpreting data, saving companies time and money. For users, it translates to a more seamless searching experience on major online marketplaces like Amazon or Facebook Marketplace.

Personalization and recommendation, core use cases in commerce, stand to benefit from multimodal AI.

“Having multimodal AI is crucial to understanding personal style and preferences,” Leskovec said.

In the realm of fashion, for instance, colors, materials and brands are all important, but so too are the look and feel of the product.

As the multimodal arms race intensifies, the tech world watches with bated breath to see who will emerge victorious. Will OpenAI’s head start to give it an insurmountable lead, or will Google’s vast resources and expertise propel it to the forefront? One thing is sure: When multimodal AI is executed flawlessly, it offers a tantalizing glimpse into a future where science fiction becomes a reality, transforming how we interact with technology and conduct business.

For all PYMNTS AI coverage, subscribe to the daily AI Newsletter.