Google Announces Gemini Omni Models That Can Convert Any Input into Video
Gemini Omni is a New Family of AI Models That Promise Cross-Modal Generation, but Details Are Sketchy
According to Engadget, Google recently announced Gemini Omni, which appears to be a new family of AI models that are intended to create content in a variety of formats. The main goal of this release is to have models that can create video, however, the company did not provide many technical specifics about what the models could do (other than create video) nor was there any information on when they would become available.
Get the latest model rankings, product launches, and evaluation insights delivered to your inbox.
Ambitious Claims of Cross-Modal Generation Capabilities
Google’s move towards creating more flexible content creation through Gemini Omni models moves away from single-format models (i.e., Text-To-Text or Image-To-Image). Instead, the new models claim to be capable of translating between multiple formats. Specifically, according to the reporting by Engadget, Gemini Omni models will be capable of creating video outputs; however, it is currently unknown if other formats (text, images, etc.) will also be accepted as input, and/or if the models will be capable of producing outputs in all formats.
It seems like Google is positioning the "Anything From Any Input" tagline as a way to promote a universal translation tool that accepts a wide array of inputs (video, text, images, etc.), and produces a corresponding number of output formats. Unfortunately, without knowing much about the actual implementation of the models, it is currently unknown how these cross-modal capabilities stack up to those offered by competitors’ tools or the same company’s prior Gemini releases.

