MegatonMegaton
News
Leaderboards
Top Models
Reviews
Products
Megaton MaskMegaton Mesh
Megaton
Menu
News
Leaderboards
  • Top Models
Reviews
Products
  • Megaton Mask
  • Megaton Mesh
Loading...
#1
Kling
Kling 2.6
#1
Google
Veo 3
#3
Google
Veo 3.1
#4
Google
Veo 2
#4
PixVerse
PixVerse v5.5
Top Models
Kling
Kling 2.6
1rank
Google
Veo 3
1rank
Google
Veo 3.1
3rank
Google
Veo 2
4rank
PixVerse
PixVerse v5.5
4rank

Copyright

AI2 Drops Molmo2: Open Video Models That Actually Know Where Things Are

January 19, 2026|By Megaton AI

The Allen Institute releases vision-language models with pixel-perfect grounding capabilities, trained on 9 million videos without using proprietary model outputs—and they're beating Gemini at its own game.

AI2 Drops Molmo2: Open Video Models That Actually Know Where Things Are
Share

The Allen Institute releases vision-language models with pixel-perfect grounding capabilities, trained on 9 million videos without using proprietary model outputs—and they're beating Gemini at its own game.

Point at any object in a video frame and ask "what happens to this next?" Molmo2 will track it through time and space with precision that, according to AI2's benchmarks, surpasses Google's Gemini 3 Pro on grounding tasks. The 8-billion parameter model achieves 38.4% F1 on video pointing compared to Gemini 1.5 Pro's 20.0%, while the entire training dataset—nine new collections covering captioning, question-answering, and tracking—is being released alongside the weights.

AI2 appears to have closed something that's plagued open-source vision models: the grounding gap. While proprietary systems from OpenAI and Anthropic keep their training data locked away, AI2 built their datasets from scratch. They avoided distilling outputs from closed models, a practice that's become standard in the open-source community but raises legal and quality concerns.

"The model achieves frontier-class performance with fewer parameters," notes The Robot Report, highlighting how the 8B version outperforms AI2's previous 72B Molmo despite being nine times smaller. The efficiency story extends to training data: Molmo2 used 9.19 million videos compared to the 72.5 million that competitors typically require, according to WaveSpeedAI's analysis.

The technical architecture, detailed in the arXiv paper, introduces several optimizations including efficient packing, message-tree encoding, and bi-directional attention. The methodological innovation may matter more. By creating seven new video datasets that avoid proprietary model outputs, AI2 sidesteps both potential legal issues and the quality degradation that comes from training on synthetic data.

Three models comprise the family: a 4B parameter version for edge deployment, the flagship 8B built on Qwen3 architecture, and a 7B variant using AI2's own Olmo base. Each targets different deployment scenarios. The 4B handles content moderation and cataloging efficiently, while the 8B tackles complex multi-frame reasoning tasks.

Subscribe to our newsletter

Get the latest model rankings, product launches, and evaluation insights delivered to your inbox.

WaveSpeedAI has already integrated the models into their API, emphasizing low-latency inference for real-time applications. Fireworks AI hosts the 8B variant with optimized GPU deployment for enterprises. The rapid platform adoption suggests pent-up demand for open alternatives to proprietary vision APIs.

The grounding capabilities represent a specific technical advance: spatio-temporal localization. Users can select pixels in one frame and track them across time, or point to an object and ask questions about its trajectory. On video tracking benchmarks, Molmo2 achieves 56.2 J&F score, establishing what alphaXiv discussion calls a new state-of-the-art for open models.

Questions remain about generalization. The benchmarks where Molmo2 excels—pointing, counting, tracking—are narrow compared to the broad capabilities claimed by frontier labs. The model's performance on creative tasks, complex reasoning, or adversarial inputs hasn't been extensively documented.

AI2's emphasis on transparency extends beyond releasing weights. They've published the complete training recipe, data collection methodology, and evaluation protocols. This level of openness contrasts sharply with the black-box approach of commercial providers, though it also means competitors can immediately replicate and extend the work.

The timing matters. As video generation models proliferate and synthetic content floods platforms, the ability to precisely identify and track objects becomes critical infrastructure. Content moderation, fact-checking, and attribution all require models that can ground claims in specific visual evidence.

Video creators gain pixel-level annotation tools with no API dependencies. Enterprises can deploy grounding capabilities on-premise, avoiding data leakage. Researchers receive complete reproducibility through data, weights, and training code. The 4B model enables real-time video analysis on consumer hardware, and open datasets provide training material for specialized domain adaptation.

The real test comes next: whether the community can build on these foundations faster than proprietary labs can extend their lead. AI2 has open-sourced the picks and shovels for video understanding. What gets built with them will determine if this marks a turning point or merely narrows the gap.

Related Articles
TechnologyFeb 2, 2026

Google's Project Genie: The Promise of Interactive Worlds to Explore

The experimental AI prototype generates playable 3D environments from text prompts, triggering a 15% gaming stock selloff.

Read more
TechnologyFeb 2, 2026

Rise of the Moltbots

A brief glimpse into an internet dominated by synthetic AI beings.

Read more
TechnologyJan 26, 2026

Adobe's Firefly Foundry: The bet on ethically trained AI

Major entertainment companies are building custom generative AI models trained exclusively on their own content libraries, as Adobe partners with Disney, CAA, and UTA to address the industry's copyright anxiety.

Read more
BusinessJan 23, 2026

Memory Prices Double as AI Eats the World's RAM Supply

Data centers will consume 70% of global memory production this year, leaving everyone else scrambling for scraps at premium prices.

Read more
Megaton

Building blockbuster video tools, infrastructure and evaluation systems for the AI era.

General Inquiriesgeneral@megaton.ai
Media Inquiriesmedia@megaton.ai
Advertising
Advertise on megaton.ai:sponsorships@megaton.ai
Address

Megaton Inc
1301 N Broadway STE 32199
Los Angeles, CA 90012

Product

  • Features

Company

  • Contact
  • Media

Legal

  • Terms
  • Privacy
  • Security
  • Cookies

© 2026 Megaton, Inc. All Rights Reserved.