Should we predict pixels for world models in robotics?
Or will JEPAs eventually win?
World models for robotics
Within the robotics community, there seems to be a general consensus floating in the air: generalist policies of the future will be built on a “world modeling” recipe rather than the VLM-backbone approach that has dominated thus far.
The argument goes as follows. VLMs are not explicitly trained to predict the future, and are thus unreliable at the kind of geometric, spatial, and physical reasoning abilities needed to predict fine-grained consequences of actions. In contrast, a world model allows a robot to “imagine” the future in order to plan, e.g., by (1) generating a video of imagined success and using an inverse dynamics model to infer the requisite actions [1, 2, 3], or (2) by directly optimizing plans using an action-conditioned world model [4, 5].
Robotics in early 2026: world model = video model
Currently in early 2026, world modeling in robotics is dominated by video world models, i.e., generative models that are trained to predict future video frames conditioned on text (and potentially actions). These build off the tremendous progress we have seen in video modeling: diffusion-based architectures trained on internet-scale data can produce surprisingly photorealistic videos with complex physical interactions; see, e.g., videos from Veo, Cosmos, and Wan.
In just the past year, we have seen video models fine-tuned on robotics data capable of performing policy evaluation [6, 7, 8], data generation [9], and inference-time plan generation [1, 2, 3, 5]; see [10] for a recent survey. For me personally, working on Veo for policy evaluation [6] provided a huge update — seeing video model “simulations” like the one below convinced me that video models were finally poised to be useful for robotics.
Challenges with video models
Despite the promising results, all video models for robotics currently suffer from the same set of hallucinations: objects that duplicate, appear from nowhere (see example below from our Veo work), disappear, or morph either spontaneously or when re-appearing after an occlusion.
Moreover, long-horizon generation is a major challenge: current video models in robotics struggle to produce high-quality generations beyond 20-30 seconds at most.
Latent world models: don’t predict pixels
Intuitively, video modeling seems like an unnecessarily challenging task for a world model. Predicting pixel-level details of the motion of leaves in the background or the precise facial features of the person who will appear in my office door is surely unnecessary. Instead, we can construct a latent world model that only predicts some features of the environment. Specifically, by predicting the predictable features, we can focus representation power on things that matter rather than nuisances such as the precise appearance of an object under a particular set of lighting conditions.
This argument has been made quite compellingly by Yann LeCun for a number of years (see, e.g., his talk at our Princeton Robotics symposium). His groups at Meta and NYU have developed various forms of joint-embedding predictive architectures (JEPAs) [11], which learn latent representations of observations that can predict representations of other (e.g., future) observations. V-JEPA2 [4] demonstrates how useful video features can emerge from this kind of self-supervised learning. In addition, [4] shows how an action-conditioned version of the model can enable robot planning by optimizing action sequences at inference time.
The argument for latent world models is particularly compelling for long-horizon tasks. Predicting pixel-level details of how the world will evolve in the next 10-20 seconds seems plausible, but doing this at the scale of minutes or hours seems both exceedingly challenging and unnecessary.
Why video models will win in the short term
Before working with video models, I was convinced by the arguments for latent world models presented above; there is a clear appeal to the minimalist approach of only predicting salient features of the world. However, I want to make the case that there are significant technical and practical benefits to video modeling that should not be underestimated. These were not obvious to me a year ago, and I’m hoping that spelling them out here will be helpful for others.
Conceptual simplicity. The task in video modeling is unambiguous: predict future frames. Similar to next-token prediction for LLMs, a clear supervisory signal can lead to good features for downstream tasks along with emergent capabilities such as object segmentation, video editing, and visual reasoning [12]. This is in stark contrast to JEPAs: the task of predicting the predictable is not fully specified and leads to representation collapse if implemented naively (the easiest way to construct a predictable embedding is to make it a constant).
Clear evaluation metrics. Hill-climbing on video models is straightforward. There are standard metrics (e.g., LPIPS or FID) that can be used to evaluate the quality of video generation. This is in contrast to JEPAs, where the loss function that one optimizes doesn’t necessarily correlate with downstream performance (although see the recent LeJepa paper [13] for some promising signs).
Inference-time scaling with verifiers. Video models allow VLMs to be used directly as verifiers. By generating multiple videos and scoring them with a VLM, we can filter out unrealistic or low-quality generations. This provides a simple recipe for inference-time scaling [14].
Video models enable policy evaluation. Video models can serve as full-blown simulators for robot policies. In order to perform closed-loop rollouts, the output of the simulator must be matched to the input of the policy. For visuomotor control, this necessitates the generation of complete images (unless the policy is forced to ingest inputs in the latent space of a latent world model).
Video models compose naturally with image editors. Video models can take edited frames as input. As we showed in our Veo work, this provides a simple recipe for policy evaluation in out-of-distribution scenarios. Real-world observations can be edited (e.g., to introduce novel objects or backgrounds) and used to condition policy rollouts. One could imagine a similar strategy for data generation in OOD scenarios with video models (similar to DreamGen).
Huge commercial incentives. The main argument for video models is actually a non-technical one. There is massive commercial pressure to develop good video generation models. From social media apps to movie making, video models will develop at a rapid pace independent of robotics. We have seen this movie play out time and time again — from depth cameras for gaming, to IMUs for smart phones, to LLMs for NLP — technologies developed for independent commercial reasons have revolutionized robotics. The same may well be true for video models.
In the near term (2-3 years), I expect that video models will continue as the dominant force in world modeling for robotics. Especially for short-horizon manipulation tasks — which remain the north star for much of robotics today — I expect that the benefits above will outweigh the potential benefits of latent world models.
Will JEPAs win in the long term?
The main unresolved technical question with JEPAs for robotics is this: are the “predictable features” learned with a JEPA the same as useful features for robotics? Features that are predictable are not by themselves useful; we can always predict the feature that maps any image to a constant. However, there is a strong existence proof in the form of DINO [15], which is arguably the biggest success of JEPA-style self-supervised learning. DINO features have led to state-of-the-art results in a broad range of downstream vision tasks such as segmentation, depth prediction, and object detection. Whether similar benefits can be achieved for world modeling in robotics remains an open question.
In order to out-compete video models, JEPAs will also need to overcome some of the commercial pressures I highlighted above. However, it seems plausible that LeCun’s AMI startup will amass sufficient resources to overcome this hurdle and demonstrate the power of JEPAs for world modeling.
If I had to place a bet in the ~5 year timeline, I would bet on JEPAs being a critical component of robotics world models. I suspect that time horizons beyond a few seconds will truly start to matter in robotics once we make progress on basic manipulation skills, and the benefits of JEPAs for planning should become apparent then.
However, for the reasons highlighted in the previous section, I don’t expect JEPAs to be a one-to-one replacement for video models. First, the JEPAs and video models can work together. Indeed, we are already seeing work that combines the two, e.g., by using a latent world model to make inference-time improvements for a video model [16]. Moreover, the representation learning objective of JEPAs can also be applied simultaneously with the video reconstruction objective. Finally, for uses cases like policy evaluation, there are very clear benefits to video generation (e.g., using image editing to generate scene variations).
Regardless of how things play out, it’s a really exciting time with multiple bets being placed by different entities, and fundamental open questions that need to be resolved.
Acknowledgements
I’m grateful to Thomas Kipf, Kaustubh Sridhar, Harris Chan, and members of the IRoM Lab at Princeton for helpful conversations on this topic.




Nice article Ani, and thanks in particular for the careful evaluation of the strengths and weaknesses of JEPA. In terms of world models, I think that there is much there other than either of these that are also potentially waiting in the wings. Video models can’t see forces or occluded objects, and we have representational concepts like geometry and dynamics that could also help. I’m interested to see the future of this topic over the next year - it’s certainly been heating up!