The Immersion Gap

These past few weeks, world models have been front and center. For the first time, a lot of what people had been dreaming about as a potential future outcome of world models is becoming a concrete and tangible reality. As people create and explore these worlds, there is a lot to marvel at, but there is still a lot to look forward to. In a sense, this is the GPT-1 moment of world models. So then, what is the GPT-2 moment? Or the GPT-3 moment?

We believe that the future of world models is already clear, but the distance between where current models are and where they need to go is very large. We need to move from navigating simple rooms and small environments to interacting with characters, objects, and tools in expansive and dynamic worlds. The interface needed to interact with these worlds must evolve from a keyboard and arrow keys to include all input interfaces: a mouse, a controller, a VR headset, AR glasses, our hands, and eventually our whole bodies.

In the coming years, these milestones will need to be met to cross what we have termed “The Immersion Gap.” The end-state of this technology is a simulation of the world indistinguishable from reality. To fully achieve this, a loop must be established. and made fast, smooth, and consistent: the player acts, the world responds, it evolves, and the player acts again.

Today, buzz around world models is almost entirely driven by visual quality, even when it comes at the cost of responsiveness, control, and iteration speed. As a subjective measure of quality, this made sense in an era of image diffusion, audio diffusion, and even video diffusion. However, to close the immersion gap, one must instead focus on interactivity.

In a race to maximize visual fidelity, interactivity has suffered massively. We’ve arrived in an era where a full second of latency (i.e., 1000ms ping) is considered playable, 4fps is considered real-time, and a rack of 8 $50,000 GPUs is considered accessible. The reason we’ve ended up here is because of the path world model development has taken.

The Road to World Development

In the early days, models were trained from scratch on restricted domains: AI DOOM, AI Minecraft, or AI CoD. This produced highly interactive models that were great at creating faithful and fun experiences in their respective games. When it came time to create more general models, almost everyone fell back to fine-tuning video models to shoe-horn in simplified controls. Since you can no longer build your own architecture from the ground up, prioritizing controllability and speed, you inherit the shortcomings of the video model you start with. Primarily:

You are limited by the model's original pre-training corpus. This prevents you from having highly dynamic environments, i.e., with NPCs, fighting, quests, functional UI elements, or other common game features. This detracts from the actual fun of the downstream experience.
You are locked in with the model's temporal autoencoder (generally 4x), which means the diffusion model you are fine-tuning only takes controls once every 4th frame. For a 24fps video, this means you are taking controls at an effective framerate of 6fps. This makes motion jarring and almost entirely blocks high-frequency input methods like a mouse, joystick, or VR headset.
The upper bound on optimizations is in optimizing the pretrained model, which is generally not designed with interactiveness, speed, or accessibility in mind. Dependencies, inefficiencies, and limitations are inherited.

This approach prevents users from reaching the level of interactivity expected from modern video games and is a major step back from the world model demos of late 2024. In some cases, world models drop user input outside of text prompts entirely. In others, user input is relegated to simple camera movement and panning with the arrow keys. Concessions are made due to the pretrained model. We get walking simulators, video editors, and tech demos.

When we launched Waypoint-1 last week, we were hoping it would start a trend and take world models back to being fun, interactive, and immersive world simulators.

Project Genie

Last week, we launched Waypoint-1. Our goal was to take world models away from purely impressive to look at and push them back toward being fun, interactive, and immersive. The launch of Project Genie, a week later, reinforced our belief and evolved our team’s thinking.

Project Genie is an incredible preview of what it feels like when you begin closing the immersion gap. The design and interface neatly complement the models' compelling visuals. As such, it has finally shown the world what world models can do!

At the same time, Genie showcases the realities of building at an enormous scale. Models this size are necessarily large, but they’re closed, prohibitively expensive to access, and run on swarms of proprietary TPUs.This isn’t a failure of execution, but instead concrete limits of the environment they’re built in. While demonstrations showcase craftsmanship at the highest end, they’re not optimized for rapid integration, iteration, or wide experimentation. Imagine the kinds of experiences users could have if this technology were widely available; if it could be run locally, modified freely, iterated on quickly, and explored without cost or gatekeeping.

In case you haven’t heard of us, we’re Overworld, the local world model company! Our goal from the beginning has been to simulate anything on everything. We don’t believe world models need to be relegated to streaming, that latency is a necessary evil, or that fun should be sacrificed for a prettier picture. By pretraining our own models from scratch, we can ensure that dependencies are light, that there is sufficient room for optimization, and that the model is extremely reactive to user input. We’ve explained why this is important, now let’s explain how we did it!

Compression: Trading Quality And Speed

The first step towards creating a latent diffusion model is the autoencoder that powers it. Since we’re adamant on avoiding temporal compression to prevent input lag, we instead focused on creating an autoencoder that could reduce the token count per frame by as much as possible. Our older blog posts discuss our research into autoencoder design extensively. By training the Waypoint-1 VAE on video game footage directly, we were able to faithfully represent HUD elements and text and preserve the 16:9 aspect ratio common to games. By following a more modern model design for the ResNet backbone, we were able to improve throughput significantly, with a final distilled autoencoder that can run at such a high framerate that it is practically negligible compared to the diffusion model itself.

While a deeper compression autoencoder can drastically reduce token count, we found there was a balance to be struck between increasing compression and keeping the latents “diffusable” for the world model itself. In the end, we settled with a latent size of 32x32 and used Muon and an exceptionally large base autoencoder to push the reconstruction quality as far as possible. We skipped any GAN objectives to prevent flickering and temporal artifacts.

World Modelling At The Edge

The size of flagship video models is in the 50-100B range. Even at a low framerate, the context lengths of these models require inference to be handled on clusters of enterprise GPUs like B200s. In order to be able to train a small, short context model, on high framerates, we train our model with varying framerates across different layers. Most layers are local, meaning that frames can only attend to other nearby frames in short windows. These layers learn high framerate patterns: fine-grained control and movements. To learn patterns on a longer time horizon, we have dilated global attention layers, which essentially attend to past frames in “hops”. For example, the 64th frame sees the 56th frame, the 48th frame, the 40th frame, and so on. This essentially learns patterns on a lower framerate. As attention is extremely sparse, the compute required for attention is reduced drastically, moving the bulk of computation to the MLP layers.

Additionally, we completely remove anything but timestep from our modulation layers (a.k.a AdaLN, popularized by DiT). Since timesteps are known during inference, this allows us to cache all AdaLN outputs, and none of the matrix multiplications inside these layers ever need to be repeated during inference.

As an interesting aside, our earlier demo model from last year was significantly weaker than Waypoint-1, with standard frame-causal attention patterns and basic distillation. Waypoint-1 small is 2x the parameter count, 4x the tokens per frame, yet runs at about the same framerate due to sparse attention patterns and additional caching optimizations. We will dive more into depth on this in the WP-1 full technical report, which is coming soon!

From Video Generators To World Generators

With the above, we have a causal video model; it is not yet a world model. An interesting thing to note is that our model diffuses one frame at a time from a 60fps video. This makes it extremely vulnerable to accumulating errors. After just a second of generation, it must perform 60 diffusion sampling loops. In contrast, the self-forcing literature we draw from denoises frames in latent chunks corresponding to 12 real frames, so it only performs 2 diffusion sampling loops in a second. Unsurprisingly, with 30x the number of forward steps per second, we decohere pretty quickly (typically in less than 100 milliseconds).

The first part of distillation is Distribution Matching Distillation (DMD), following the standard CausVid recipe. In DMD, you create a few-step student model with a teacher and critic. Frozen teacher, student, and critic start as copies of the teacher. To simplify heavily, the student generates, the teacher denoises in their own way, and the critic denoises in the same way as the student. By looking at the difference between these denoiser outputs, we get a gradient that tells us how “off” the student is from what the teacher would have predicted. By fitting the criticism to the students' outputs, and improving the students’, we can bring it in line with the teacher. Since the teacher's outputs can be augmented with classifier-free guidance, this also lets us distill a strong guidance signal into the student, improving control following. Typically, DMD can be very unstable, and we found that in our case, due to the extreme load on our generator (30x the sampling steps), we had to take extra measures to mitigate instability. To this end, we trained the critic for 10x as many steps as the student and kept all models in full FP32 precision, which required a rather complicated FSDP setup for WP1 Medium.

The second part of distillation is self-forcing. To enable KV caching and long consistent rollouts, one must address the fact that predicting new frames given historical generated frames is a very different task from predicting new frames given historical ground truth frames. Small errors in generations can build up and explode extremely quickly. To resolve this, we utilize Self Forcing Plus Plus (SFPP). Without SFPP, you would simply generate some frames with the student, then pass them to the teacher/critic for DMD loss calculation. In SFPP, you instead roll out the student so that it generates many consecutive frames. As it runs forward, it will start to increasingly be fed more and more of its own mistakes, until quality is entirely destroyed. So you let it roll out, then take the downstream generated frames, which were themselves conditioned on preceding generated frames, and pass that to the teacher/critic for DMD computation. With SFPP, we were able to get WP 1.1 to remain stable for significantly longer periods.

The Next Step

For us, Project Genie reinforced why closing the immersion gap requires more than visual quality alone. It also strengthened the question behind why we created Overworld in the first place: What could people create and experience if world-building technology were widely available, easier to modify, and faster to iterate on?

Next week, we’re releasing Waypoint-1.1 Small, a soft upgrade to our release of Waypoint-1 Small last week. This model only requires 16GB of VRAM, can run on modern consumer hardware, and is super easy to build and mod on top of.

While the model is a modest 2 orders of magnitude smaller in parameter count than Genie, we are rapidly iterating to improve quality and accessibility, to create a model that anyone can run, regardless of hardware constraints!

If you want to explore local-first world models, try our current Waypoint-1 version and keep an eye out for Waypoint-1.1 next week: https://over.world/