Google announces Gemini Omni models for generating video from any

Subscribe to our newsletter

Get the latest model rankings, product launches, and evaluation insights delivered to your inbox.

Ambitious Claims of Cross-Modal Generation Capabilities

Google’s move towards creating more flexible content creation through Gemini Omni models moves away from single-format models (i.e., Text-To-Text or Image-To-Image). Instead, the new models claim to be capable of translating between multiple formats. Specifically, according to the reporting by Engadget, Gemini Omni models will be capable of creating video outputs; however, it is currently unknown if other formats (text, images, etc.) will also be accepted as input, and/or if the models will be capable of producing outputs in all formats.

It seems like Google is positioning the "Anything From Any Input" tagline as a way to promote a universal translation tool that accepts a wide array of inputs (video, text, images, etc.), and produces a corresponding number of output formats. Unfortunately, without knowing much about the actual implementation of the models, it is currently unknown how these cross-modal capabilities stack up to those offered by competitors’ tools or the same company’s prior Gemini releases.

Editorial illustration for Google announces Gemini Omni models for generating video from any input — create anything from any input,

Google announces Gemini Omni models for generating video from any input

Ambitious Claims of Cross-Modal Generation Capabilities

Publishers and author sue Google over using books to train Gemini

PixVerse’s $439M raise suggests investors still see room for specialist AI video platforms

Google adds AI labels to ads, but enforcement relies on advertiser honesty

Fika Jobs raises $4M to replace resumes with AI video interviews