Engineering Safety for Interactive World Models

In our last post, we covered prompting, sanitization, and guardrails. We explained how those systems evolved into a flexible pipeline supporting creativity and safety. This week, we’re zooming out.

Safety at Overworld isn’t comprised of one filter or classifier. It’s a set of decisions that began at Day 0 of model development and continues through training, deployment, hosting, and community feedback.

Over the past year, we’ve been building dataset scanning pipelines, prompt sanitization layers, licensing decisions, and terms of service for open models and hosted experiences like overworld.stream. Safety is not a problem to solve and then deprioritize; it’s a continuous cycle of iteration.

In this blog, we’ll walk through how safety discussions started after Overworld’s founding, the technical and organizational challenges involved, what we’ve implemented so far in Waypoint 1, Waypoint 1.5, and Overworld.stream, how licensing and product design influence safety, and what we plan to improve next.

Our overarching goal is to build systems that are creatively empowering and safe enough to operate responsibly.

Safety at Overworld: From Day 0

Throughout the past year, safety has been an ongoing discussion. From the earliest conversations about the model, our team was asking:

What exactly are we training?
Who is it for?
How do we collect the data to train the model?
How do we evaluate the data?
How do we deliver the final model?
When should models be open-sourced versus hosted, and what guardrails should exist at generation time?

These discussions aren’t simple, and initially, everyone didn’t agree on the answers, but they’ve been an integral part of our building process since the beginning.

The systems we’re creating at Overworld sit at the frontier of generative AI. From text-to-text generation to prompted images to video, and now to real-time interactive world models, the field has evolved quickly. We’re proud to be at the forefront, but world models introduce new considerations.

Rather than generating a single static output, world model systems generate persistent environments at high frame rates and respond continuously to user input. Safety systems must therefore operate differently than they do in traditional text or image models. World models greatly expand what users can create with a single prompt, allowing them to imagine environments, iterate rapidly, and communicate ideas visually with minimal friction. Alongside this exciting creative potential comes risk.

Most generations using Overworld so far are creative, fun, or just silly. However, as with any creative system, some users will attempt to produce illegal or inappropriate content. Initially, we implemented a prompting pipeline that sanitizes user input before it reaches the model. We described this in our previous blog post. But safety doesn’t begin or end at the prompt.

As gamers, creators, and researchers ourselves, we are constantly reevaluating the tension between creative freedom and the potential for harmful content. Both realities matter. Creative tools should empower expression. At the same time, these systems must operate within legal and platform constraints.

In the remainder of this post, we’ll dig into the systems we have built so far, the tradeoffs we’ve encountered, and the areas we intend to improve.

The Challenges of Safety

Safety comes with several categories of cost:

Creative: excessive filtering can reduce creative expression
Financial: filters, classifiers, and audits require infrastructure and compute
Development time: safety systems compete with feature development
Social: definitions of “safe” differ across communities

Alongside these challenges comes a fundamental uncertainty of what counts as “unsafe.” Within the Overworld team, we’ve defined unsafe systems using practical terms.

A system is unsafe if it is:

Released prematurely without testing or red-teaming
Deployed without clear terms of service or licensing
Missing filtering or classification layers
Capable of generating clearly illegal or age-inappropriate outputs

After defining what constitutes an unsafe system, engineering decisions follow. These decisions can affect nearly every layer of the stack, including training data, model architecture, product design, prompt handling, generation-time analysis, output evaluation, reporting systems, sharing and distribution, and open-source release strategy.

Further, safety interacts with the question of open-source versus hosted model. Hosted models enable more direct safeguards and monitoring. Meanwhile, open models empower researchers and developers but provide fewer direct controls once released. Community demand for open releases is often strong, while internal teams typically want additional safeguards. Balancing these two approaches isn’t something we take lightly, and it will remain part of the safety process as we continue building Overworld.

Navigating safety decisions also exists within the constraints of building quickly as a startup. The reality is that speed matters, iteration cycles are short, and resources are limited. Early internal versions of the model prioritized end-to-end pipeline functionality, from curated data to training, filtered output, and hosted testing. Safety systems have continued evolving alongside that process.

These tradeoffs shape how we approach safety in practice. Instead of relying on a single safeguard, Overworld has been building safety systems across multiple layers of the stack.

Safety Measures We’ve Implemented So Far

Safety at Overworld operates across dataset filtering, model training decisions, runtime safeguards, licensing, and terms of service. Together, these layers form a pipeline of safeguards rather than relying on a single mechanism.

Training data is the first step in safety decision-making. We built a video-scanning pipeline to analyze our pretraining corpus before training. This pipeline processed more than 350,000 video clips using a 30-core SkyPilot CPU cluster. It runs two classifiers sequentially: OpenNSFW2 (ResNet-50) for explicit content and CLIP (ViT-B/32), queried with prompts targeting potential depictions of minors across human, cartoon, and stylized formats.

Each processed video produces a filter.json file that records flagged frames by category and determines whether clips should be fully excluded. This approach enables frame-level filtering rather than blunt clip-level removal. Flagged clips above confidence thresholds are downloaded and manually spot-checked before final calibration of filtering thresholds, resulting in a dataset with auditable metadata across the entire corpus.

These dataset safeguards operate before training begins. Once the model deploys, a second layer of safety systems takes over at runtime.

Runtime Considerations for Overworld.stream

Safety considerations for Overworld.stream operate along two tracks: what enters the model at runtime and what data the model was trained on. The runtime layer centers around a prompt sanitization pipeline where user prompts pass through an intermediate system before reaching the video diffusion model. This layer removes or transforms prompts containing IP references, celebrity likeness, brands or trademarks, explicit sexual content, depictions of minors, and other disallowed content categories.

The earliest version returned a structured JSON output describing removed concepts alongside sanitized prompts. While this approach was useful for debugging, it made the system unnecessarily complex.

The current system is simplified, only returning a sanitized prompt. It rewrites user intent into a safe environmental description suitable for generation and also integrates fal.ai’s Juggernaut model for seed image generation, with its built-in safety checker enabled as an additional filtering layer. All inference traffic also goes through a FastAPI proxy to ensure API keys remain server-side and prevent direct client-side access.

Licensing, Terms of Service, and Product Design

Licensing, model cards, and product terms play important roles in safety. Overworld was founded by researchers from EleutherAI, and we strongly support open research and experimentation. Each release of our model includes an Apache-licensed model, a larger non-commercial research model, and our GPL-licensed inference library, WorldEngine. These decisions support open experimentation and discourage closed, unattributed derivatives.

Every release also includes a model card that documents capabilities and known risks. We encourage anyone trying derivative models using Waypoint to publish model cards as well.

Terms of Service apply to our hosted services, such as Overworld.stream and Biome, as well as to partners hosting Waypoint models.

Future Work and Evaluation

The efforts behind our safety systems are ongoing. Our current work includes:

Expanded moderation dashboards for Overworld.stream
Generation-time safety classifiers
Post-generation analysis of outputs
Selective content obfuscation (e.g., blurring problematic frames rather than discarding entire generations)
Real-time review pipelines that combine automation and human oversight
Partnerships with third-party safety services
Independent code audits of streaming infrastructure
Research collaborations
Bounty programs for jailbreaks and safety issues

Some details of these systems will remain undisclosed for practical reasons, but we plan to share more as they mature. Ensuring safety is a holistic process that includes technical safeguards and human evaluation to understand how models behave in the real world. Our future evaluation work includes:

Structured red-teaming before major releases
Surveys and qualitative feedback from users
Monitoring attempted jailbreaks or prompt exploits
Analysis of content distributions in generated outputs
Confidential reporting channels for problematic generations

Each of these items helps ensure our models are operating safely enough in practice, not just in theory.

Get Involved

As a small team, we recognize that our internal perspectives are limited. Safety in generative systems is a multidisciplinary problem that requires input from engineers, artists, researchers, red-teamers, and community members.

If you’re interested in exploring these challenges with us, try the model, experiment with the tools, and share feedback with the community. We’d love to hear from you and see what you create in Overworld.

Your involvement can help build the next generation of world models and help expand the systems that keep them safe enough to deploy. Join the discussion, try the latest releases, or apply to work with us.