Three Titans Unleashed Multimodal AI
Updated: Oct 5
Last year marked the debut of prominent generative AI models. This includes text-to-image models like Midjourney and DALL·E, and text-to-text chatbots like ChatGPT. Their emergence spurred an AI craze that continues unabated. This year, Google, having noted its previous missteps and despite already possessing a potent chatbot in LaMDA, countered by announcing Bard (an offshoot of LaMDA). This new offering allows for image inputs and was unveiled at Google I/O 2023 on May 10, followed by a detailed release on their blog two months later. Meanwhile, OpenAI, albeit discreetly beta-testing its own variant, commenced a limited public launch of GPT-4V(ision) at the month's start, which integrates with ChatGPT. Meta (formerly Facebook), is not to be left behind, announcing a comprehensive AI plan centered around its open-source LLaMa, boasting avatars with image-reading capabilities. With these strides by the three AI behemoths, it's safe to posit that 2023 heralds the dawn of the "multimodal" AI era.
Definition That's Matter
The notion of multimodal AI isn't groundbreaking within the machine learning realm. Yet, its initial definition posed challenges, as highlighted by the 2021 paper "What is Multimodality?". Simple text-to-text generative AI, dubbed "unimodal", is easily grasped without an intricate definition. But the waters muddle when targeting AIs to process input from three modalities - text, image, and voice - and similarly produce outputs across these modes.
The calculations for possible pairs of input and output modalities are broken down as follows: For 1-to-1 mapping, there are 9 possible pairs, considering permutations. For 1-to-2 and 2-to-1 mapping, there are 18 pairs, taking into account both permutations and combinations. 1-to-3 and 3-to-1 mapping yield 9 pairs. In the case of 2-to-2 mapping, there are 3 possible pairs. For 2-to-3 and 3-to-2 mapping, there are 36 pairs, considering combinations and permutations. Lastly, 3-to-3 mapping results in 1 pair. Adding all these possibilities together, there are a total of 76 possible pairs of input and output modalities, accounting for all specified mappings, including permutations and combinations based on the order of selection.
Image from [source]
Merely assessing the input-output modality of AI models doesn't cut it. While text may appear straightforward, both images and voices encapsulate complexity on a far greater magnitude. An image could carry symbolic significance, while a voice may convey emotional nuance or even tempo. And when one considers speech or delves deeper into music and non-verbal sounds, the complexity becomes even more apparent.
The paper in question expands beyond the mere consideration of input-output modes, proposing a "task" oriented approach. It defines a multimodal machine learning task as one "where inputs or outputs are represented differently or comprise distinct atomic units of information." From a machine-centric perspective, modality pertains to specific encoding mechanisms of information. On the other hand, a human-centric angle emphasizes how information communicates to humans through.
Yet, this human-centric definition fails to bridge the gulf between the vast sensory capabilities of humans and machines. It remains nebulous about how diverse human perceptual signals translate into specific inputs digestible by a machine learning system. And while human biology and psychology undeniably offer precious insights into machine learning research, the discipline should cast its net wider, even beyond human paradigms, and perhaps into other organisms.
While academic definitions provide precision, they can often be too intricate for public consumption. Perhaps, then, it's more pragmatic to lean on a definition that, though a touch nebulous, captures the essence succinctly for business and marketing purposes. OpenAI's paper on the GPT-4V(ision) report card offers just that, defining multimodal LLMs as tools that "enhance the impact of language-only systems with innovative interfaces and capabilities, allowing them to tackle new tasks and offer unique user experiences."
Mixed results from both GPT-4V(ision) (on the left) and Bard (on the right)
This OpenAI document reveals that the organization wrapped up the training of GPT-4V(ision) and GPT-4 in 2022, and by March 2023, it started rolling out early access to the system. One intriguing collaboration involved "Be My Eyes", culminating in the creation of "Be My AI" (a clever auditory nod to 'eye') aimed at offering visual aid to the visually impaired. From March to early August 2023, almost 200 visually impaired beta testers collaborated to refine the safety and usability of this tool. By September, the beta test pool had swelled to 16,000 users, clocking in an average of 25,000 description requests daily.
Microsoft's extensive evaluation, detailed in a hefty 166-page paper titled "The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)", yielded mixed reviews. Their team highlighted instances of striking performance but also noted misinterpretations, such as the miscounting of items within images. Though Google hasn't released a comparable model report card, an external scholarly paper, "How Good is Google Bard’s Visual Understanding? An Empirical Study on Open Challenges", mirrored these mixed results.
Even as multimodal LLMs are yet to achieve full-scale public deployment, we put their capabilities under the microscope across three platforms: 1) Google Bard, 2) ChatGPT's advanced data analytics, and 3) Microsoft's Bing. In tasks such as text-to-image, it's not merely about leveraging datasets like Visual Question Answering (VQA) as one would in image-to-text scenarios. Relying solely on VQA doesn't suffice; other datasets are crucial to training the model to grasp the subtle interpretations of multimodal input.
From reports by independent testers examining the true prowess of GPT-4V(ision) to OpenAI's own candid admissions in their model report, it's evident that certain limitations persist. Our own experiments hint at Microsoft's Bing delivering the most refined answers. Bard, while adept at pinpointing the right caption, struggles with deeper contextual understanding, a realm where Bing shines. On the other hand, ChatGPT's advanced data analytics occasionally ventures into overly intricate details, often overshooting what's strictly necessary after few iterations.
In the vast landscape of artificial intelligence, multimodal LLMs represent both our next frontier and our current limitation. The challenge isn't simply in teaching AI to process different forms of data but in enabling it to interpret and comprehend multimodal sensory data akin to human cognition. Our experiments reveal that while strides have been made, there's a chasm separating current capabilities from the human-like comprehension that many fear. This vast gulf, for now, places the daunting specter of truly sentient AI firmly in the realm of the distant future. Yet, it's this very gap that underscores the enormous potential waiting to be unlocked.
The applications of a finely-tuned multimodal AI are as expansive as they are transformative. Envision computer vision glasses, reminiscent of Google Glass, supercharged with such AI capabilities—these could revolutionize Augmented Reality, offer unparalleled immersive visual experiences, or become a pivotal aid for the visually impaired. Further afield, geopolitically, a perfected multimodal AI could radically reshape intelligence gathering and analysis. From interpreting HUMINT and SIGINT via nuanced voice analysis to decoding GEOINT from satellite imagery, the potential is staggering. One could even foresee real-time interpretation of multilingual diplomatic events, the tracking of illicit goods through integrated surveillance, or the detection of environmental changes through simultaneous audio-visual data. In essence, while we remain in the early chapters of the multimodal AI narrative, the unfolding story promises to be one of profound global impact.