Copyright
AI2 Drops Molmo2: Open Video Models That Actually Know Where Things Are
The Allen Institute releases vision-language models with pixel-perfect grounding capabilities, trained on 9 million videos without using proprietary model outputs—and they're beating Gemini at its own game.

The Allen Institute releases vision-language models with pixel-perfect grounding capabilities, trained on 9 million videos without using proprietary model outputs—and they're beating Gemini at its own game.
Point at any object in a video frame and ask "what happens to this next?" Molmo2 will track it through time and space with precision that, according to AI2's benchmarks, surpasses Google's Gemini 3 Pro on grounding tasks. The 8-billion parameter model achieves 38.4% F1 on video pointing compared to Gemini 1.5 Pro's 20.0%, while the entire training dataset—nine new collections covering captioning, question-answering, and tracking—is being released alongside the weights.
AI2 appears to have closed something that's plagued open-source vision models: the grounding gap. While proprietary systems from OpenAI and Anthropic keep their training data locked away, AI2 built their datasets from scratch. They avoided distilling outputs from closed models, a practice that's become standard in the open-source community but raises legal and quality concerns.
"The model achieves frontier-class performance with fewer parameters," notes The Robot Report, highlighting how the 8B version outperforms AI2's previous 72B Molmo despite being nine times smaller. The efficiency story extends to training data: Molmo2 used 9.19 million videos compared to the 72.5 million that competitors typically require, according to WaveSpeedAI's analysis.
The technical architecture, detailed in the arXiv paper, introduces several optimizations including efficient packing, message-tree encoding, and bi-directional attention. The methodological innovation may matter more. By creating seven new video datasets that avoid proprietary model outputs, AI2 sidesteps both potential legal issues and the quality degradation that comes from training on synthetic data.
Three models comprise the family: a 4B parameter version for edge deployment, the flagship 8B built on Qwen3 architecture, and a 7B variant using AI2's own Olmo base. Each targets different deployment scenarios. The 4B handles content moderation and cataloging efficiently, while the 8B tackles complex multi-frame reasoning tasks.
Get the latest model rankings, product launches, and evaluation insights delivered to your inbox.
WaveSpeedAI has already integrated the models into their API, emphasizing low-latency inference for real-time applications. Fireworks AI hosts the 8B variant with optimized GPU deployment for enterprises. The rapid platform adoption suggests pent-up demand for open alternatives to proprietary vision APIs.
The grounding capabilities represent a specific technical advance: spatio-temporal localization. Users can select pixels in one frame and track them across time, or point to an object and ask questions about its trajectory. On video tracking benchmarks, Molmo2 achieves 56.2 J&F score, establishing what alphaXiv discussion calls a new state-of-the-art for open models.
Questions remain about generalization. The benchmarks where Molmo2 excels—pointing, counting, tracking—are narrow compared to the broad capabilities claimed by frontier labs. The model's performance on creative tasks, complex reasoning, or adversarial inputs hasn't been extensively documented.
AI2's emphasis on transparency extends beyond releasing weights. They've published the complete training recipe, data collection methodology, and evaluation protocols. This level of openness contrasts sharply with the black-box approach of commercial providers, though it also means competitors can immediately replicate and extend the work.
The timing matters. As video generation models proliferate and synthetic content floods platforms, the ability to precisely identify and track objects becomes critical infrastructure. Content moderation, fact-checking, and attribution all require models that can ground claims in specific visual evidence.
Video creators gain pixel-level annotation tools with no API dependencies. Enterprises can deploy grounding capabilities on-premise, avoiding data leakage. Researchers receive complete reproducibility through data, weights, and training code. The 4B model enables real-time video analysis on consumer hardware, and open datasets provide training material for specialized domain adaptation.
The real test comes next: whether the community can build on these foundations faster than proprietary labs can extend their lead. AI2 has open-sourced the picks and shovels for video understanding. What gets built with them will determine if this marks a turning point or merely narrows the gap.


