EulerFold

Research Decoded

Research Index

Join the EulerFold community

Track progress and collaborate on roadmaps with students worldwide.

🐢
Research Decoded/Gemini Team, Google (2023)

Gemini: Native Multimodal Models

Team, G., et al. (2023). Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.

Read Original Paper

In late 2023, Google introduced 'Gemini,' a family of models designed from the ground up to be 'natively multimodal.' While previous 'multimodal' models often consisted of separate vision and language components that were bolted together after training, Gemini was trained simultaneously across text, images, audio, video, and code. This allowed the model to reason across different types of information with a fluidity that mimics human perception. It was a shift from modular multimodality to a single, integrated architecture that treats all data types as first-class citizens.

The Native Multimodal Backbone

Gemini transitioned multimodal AI from modular, "bolted-on" encoders to a natively integrated architecture that treats text, images, audio, and video as first-class citizens of a single transformer-based decoder. By interleaving visual patches, audio samples, and text tokens into a unified sequence from the very start of training, the model enables cross-modal self-attention across every layer of its multi-billion parameter backbone. This structural integration allows for a fluidity of reasoning—such as explaining a physics diagram or interpreting the nuances of a live video—that had been impossible for systems that forced diverse data through a linguistic or visual bottleneck. It proved that true multimodal intelligence is not a collection of specialized senses, but an emergent property of a single, coherent reasoning engine that can attend to the raw complexity of the world in its original, multi-dimensional form.

The Tokenization of Reality

How Gemini achieves this integration lies in its sophisticated tokenization process. For visual data, the model does not rely on a fixed-resolution encoder; instead, it uses a variable-resolution approach that preserves the aspect ratio and fine-grained details of an image. Audio is sampled at 16kHz and converted into a sequence of tokens using a dedicated neural mapper, while video is treated as a series of image frames interleaved with precise timestamps. This ensures that temporal dynamics and spatial relationships are preserved as first-class citizens in the model's context. By treating a glass breaking in a video and the word 'break' in a sentence as equivalent units of information, the model can reason across domains with a fluidity that was previously impossible. It suggested that the most effective architectures are those that can ingest the raw complexity of the world without forcing it through a linguistic bottleneck.

Scaling on TPUv4 and TPUv5e

The training of Gemini required a massive leap in infrastructure, utilizing Google's custom TPUv4 and TPUv5e accelerators across multiple data centers. To handle the scale of trillions of tokens and billions of parameters, the researchers implemented a combination of model, data, and pipeline parallelism, ensuring that the workload was distributed with minimal communication overhead. A critical technical challenge was the management of hardware reliability; at this scale, silent data corruption and chip failures are inevitable. Google developed automated recovery systems that could detect a failing unit and restore the training state to a healthy subset of the fleet within minutes. This level of 'infrastructure as code' allowed the training process to remain stable over months of continuous operation. It proved that the success of a foundation model is as much a feat of systems engineering as it is a breakthrough in machine learning logic.

Reasoning Beyond Text

The reasoning behind Gemini was to prove that native multimodality leads to superior performance on complex tasks that require both visual and logical reasoning. Gemini Ultra became the first model to outperform human experts on the MMLU benchmark, but its real breakthrough was in its 'cross-modal reasoning'—the ability to look at a chart, understand the underlying data, and then write code to reproduce it. This revealed that a model's 'intelligence' is amplified when it can see the world through multiple lenses at once. It proved that the future of AI is not in building better 'chatbots' but in building systems that can perceive and act in the world as holistically as humans do. It raises the question of whether the next leap in intelligence will come from even larger models or from a deeper integration of sensory inputs that more closely mimic the human experience.

Dive Deeper