Published on November 16, 2023, 4:12 pm
Google and Google Deepmind have introduced Mirasol, a small AI model designed to answer questions about videos and achieve new records. The challenge with understanding videos lies in integrating information from various sources like video, audio, and text. Current AI systems struggle to process diverse data streams and large amounts of data.
In their study, researchers from Google and Google Deepmind present an approach that enhances multimodal understanding of long-form video. The Mirasol AI model aims to address two key challenges. First, modalities such as video and audio are synchronized in time but asynchronous with titles and descriptions. Second, the vast amounts of data generated by video and audio strain the model’s capacity.
The team utilizes combiners and autoregressive transformer models for Mirasol. The time-synchronized video and audio signals are processed by a model component that divides the video into segments. Each segment is then processed by a transformer, which learns the relationships between them. Simultaneously, another transformer processes the contextual text. Both components exchange information regarding their respective inputs.
In the video-audio component, a unique transformation module called the Combiner extracts common representations from each segment and compresses the data through dimension reduction. Currently, the model with 3 billion parameters can process videos containing 128 to 512 frames, where each segment consists of 4 to 64 frames. In contrast, other larger models can usually only handle 32 to 64 frames for an entire video while relying primarily on text-based transformers with additional modalities.
During tests, Mirasol3B surpassed previous benchmarks in video question analysis while being significantly smaller in size compared to other models. Furthermore, it can process longer videos seamlessly. By incorporating memory into a variant of the combiner, the team achieved an additional 18 percent reduction in required computing power.
In the future, Mirasol-like models could be utilized by chatbots such as YouTube’s recently launched AI assistant to provide accurate answers to video-related questions, enhance functions like automatic categorization, and improve video chapter marking.