Whitepaper · June 2026

Learned simulators for evaluating
general robot policies

Fern is building scalable evaluation and reinforcement-learning environments for robotics. We train high-fidelity, action-conditioned world models from real robot data, so any policy can be benchmarked and improved without ever running on physical hardware.

The problem with evaluating robot policies

General robot policies are improving fast — but the way we measure them hasn’t. The dominant evaluation methodology is still “put it on a real robot, run a few rollouts, eyeball the results.” That has several problems.

It doesn’t scale. Each evaluation costs operator time, hardware wear, and resets between trials — meaningfully testing a single checkpoint takes hours, not seconds.
There’s no apples-to-apples comparison. Two checkpoints are never run under identical conditions, so a difference in success rate could just as easily come from drift in the setup as from the policy itself.
It isn’t reproducible. Lighting, object placements, and contact dynamics drift between sessions, so results don’t transfer cleanly across runs or between research groups.
RL is impossible on hardware. Reinforcement learning only produces a useful signal when you can run thousands of rollouts in parallel at scale — and there is simply no way to do that on physical robots, where every rollout ties up a real machine in real time.

The same problems were solved for game-playing agents and language models with simulators and benchmark suites. Robotics doesn’t have either yet.

Why not use physics simulators?

The obvious answer is using a physics simulator. In practice, there are two main issues:

The sim2real gap is huge. Hand-built simulators look and behave differently enough from reality that you end up fine-tuning a separate policy for sim just to run evals.
You can’t recreate every environment in sim. Each new object and scene needs 3D artists to build custom assets and hand-tuned physics. As you scale up deployments you want to test against your real environments, and it simply isn’t feasible to rebuild all of them in a simulator every time. A learned world model is built directly from the data your robot continuously sees, so it already captures all the environments you actually operate in.

A simulator learned end-to-end from real data

Instead of writing a simulator, we learn one. Given a starting image and a stream of actions, our model rolls out future frames that match what would have happened on the real robot. The physics isn’t hand-coded — it’s learned from real data.

High-fidelity images. The simulator outputs 256×256 RGB frames at the same effective rate as the real teleop trajectories that trained it.
Action-conditioned. Drive it with the same end-effector commands you’d send a real bimanual setup; the rollout responds to your input frame-by-frame.
No sim-specific fine-tuning. Because the simulator is learned from the same real-robot data your policy trains on, you evaluate the exact checkpoint you’d deploy — no separate sim policy.
No real robot needed. Once the model is trained, every researcher gets the same physics — no shipping hardware, no calibration drift, no operator queue.
Apples-to-apples comparison. Two policies, one identical environment, one identical starting state. We plan to host these benchmarks publicly so the field can move from anecdotal to quantitative comparisons.

The world model behind it

The closed-loop result above rests on a world model that can render faithful, action-conditioned rollouts. Here are two clips from the validation set that show that fidelity directly. Left half is the real robot recording; right half is the model’s rollout from the same starting frame and the same action stream — never seen during training.

Ground truth

Generated

bimanual rope · validation episode br_0000

Our original inspiration came from Wang et al.’s Interactive World Simulator project — the clip above is from a proof-of-concept model we trained on its bimanual teleop dataset. But we’ve substantially re-architected the model since. Where the paper generated a single third-person camera view, ours produces four mutually consistent camera frames from egocentric video, which required novel architectural techniques to achieve. To our knowledge, the clip below is the first example of a world model that produces frames from a production-scale bimanual setup end-to-end in a single network:

Ground truth

Generated

rice scooping · production bimanual setup

We’ve also seen a clear data-scaling signal: training one model jointly across more diverse tasks raises the fidelity of each individual task environment rather than diluting it. The simulator gets better at any one task as we add more tasks — the hallmark of a system that improves with data, which is exactly the property you want underneath a benchmark and RL platform.

Case study

Closing the loop on a customer’s food-prep policy. We ran their rice-scooping policy fully closed-loop in our sim and the predicted joint actions correlated exactly with the policy running on the real robot — the generated frames were also highly correlated with the real frames the robot saw in the real world. Read the case study →

Try it yourself

The simulator below is the same model running live on a single cloud GPU. Click Start the demo, then steer the bimanual setup with the keyboard. Every frame you see is generated on demand from the actions you send — there’s no recorded video being replayed.

This demo is a proof of concept. It’s pinned to a single multi-task checkpoint trained on a public open dataset, served just to prove the architecture and the live-driving loop work end-to-end in a browser.

Note: the GPU is shared. If the canvas takes a moment to come live, it’s likely already in use by someone else.

Idle

Click the canvas, then drive the rope.

What’s next

What we’re actively working on:

Architecture R&D. We’re actively iterating on the model architecture to increase long-horizon stability, lower per-frame latency, and improve contact physics on difficult objects like deformables.
First-party data collection. We’re recording our own bimanual teleoperation in-house on a growing fleet of robots — expanding the base model’s coverage of grippers, contact regimes, and scene diversity well beyond what any single open dataset provides.
Custom world models for customers. Most robotics companies already have terabytes of teleoperation data sitting in cold storage from training their own policies. We re-purpose that data to fit a world model on their specific embodiments and tasks, so they can evaluate and RL-train their checkpoints against their physics, not a generic one.
Public benchmarks. A growing catalog of evaluation tasks — manipulation, navigation, mobile manipulation — hosted on this site. Submit a policy, get a leaderboard placement, see exactly where it succeeds and fails.
RL environments. The same world models exposed as Gym-style environments for offline + online RL. Train against learned physics, deploy to real robots without ever burning hardware time on bad policies.

We’re building this for three kinds of teams. Policy developers who want their checkpoints evaluated end-to-end on a managed cloud platform, without standing up their own robot fleet. RL researchers who want Gym-style environments backed by learned physics, so training runs don’t need real hardware in the loop. And robotics companies who want a custom world model fitted to their own embodiments and existing teleop data — so evaluation and RL training happen in their physics, not a generic one. If any of those describe you, reach out at founders@fern.bot.