Discover Flamingo, DeepMind’s visual language model

Among computer vision models, the most commonly used training method is to pre-train the model on a huge amount of multimodal data from the internet and then fine-tune it based on the task at hand. Fine-tuning is a resource-intensive process that involves tuning and annotating thousands of data points. This was until recently, when multimodal visual language models were trained with a contrasting focus. These models were trained with a zero-shot learning approach to perform entirely new tasks, eliminating the need for adjustments.

Source: research paperExamples of Flamingo Inputs and Outputs

These models, however, only offer a score of similarity between a text and an image, indicating that they can only be useful in a limited set of cases, such as classification, where a defined set of results is given beforehand. Such models are not efficient in generating language, which includes tasks such as captioning and visual question answering. Visually conditioned language generation models have also been used, but not with much success. Last week, DeepMind demonstrated a visual language model called Flamingo that uses tap learning on a range of vision and language tasks opened by a few prompts. The research team conducted a study testing the model on a range of multimodal tasks.

Build and train

Source: research paperExamples of interlaced text and visuals

Flamingo has been powered with text intertwined with videos and images, thanks to which it is able to handle a wide range of tasks. The model can work on both open tasks such as captioning and visual question answers, which work on text generation, as well as closed tasks such as classification, which involves selecting the best category or response in a given set. Most importantly, Flamingo is trained with a few-shot learning approach where examples of annotated visual and text pairs are interleaved and fed as training prompts without making changes to the model’s weights.

Flamingo was trained using pre-trained models to avoid spending computational power to train a model from scratch. For the vision part, the researchers pretrained a vision encoder with a contrastive text-image approach similar to CLIP. The model then works to extract semantic spatial features that would describe attributes that might appear as a query for visual information: color, shape, position, and nature. For the linguistic part, an existing autoregressive language model was used and trained on a large and rich text corpus.

Source: research paperFlamingo Model Architecture

The large amount of information stored in the weights of the language model allows Flamingo to acquire its strong generative quality of language. These two pre-trained models are then interconnected via two learnable architectures. The weights of the two models are fixed so that their initial capacity remains the same. First, the Vision Encoder sends spatio-temporal features to the Perceiver Resampler, obtained via images or video and produces a set of fixed-size visual tokens as output. In the next step, these visual tokens are used to condition the frozen language model using cross-attention layers that are interwoven between the pre-trained language model layers. This is a new way for the language model to absorb visual information for the next token prediction task.

To train the model on a new task, alternating visual inputs and text responses are followed by a final test video or image. Once given this prompt, either the output text can be sampled or the probability of a fixed set of completions is evaluated. Flamingo’s ability to handle intertwined text and visuals makes it a natural fit for context-aware learning in a few shots, similar to GPT-3, which also used the text-in-a-shot prompt. The large language model recently released by DeepMind, the 70 billion parameter Chinchilla, was used as the base model for the larger Flamingo model.


Three Flamingo models were obtained: a 3 billion model built on a 1.4 billion frozen language model, a 9 billion model built on a 7 billion frozen language model, and an 80 billion model built on the 70 billion frozen Chinchilla model. The research tested Flamingo on 16 tasks, during which it ended up outperforming previous learning approaches in a few hits, even with four examples given for one task.

Source: DeepMind BlogFlamingo engaging in a multimodal conversation and passing the Stroop test

In addition to testing Flamingo on current benchmarks, its performance was qualitatively compared with respect to captioning gender and skin color related images. Captions generated by Flamingo were also run through Google’s Perspective API to assess its toxicity levels. The study also demonstrated qualitative examples showing interesting interactive capabilities, such as the ability to “chat” with the model and ask questions about random information on input images and videos. Flamingo was found to be flexible and could potentially be a bridge between large language models and visual representations progressing towards general purpose visual understanding.

Comments are closed.