FernFern← Home

Whitepaper · v0.1 · April 2026

Learned simulators for evaluating general robot policies

Fern is building scalable evaluation and reinforcement-learning environments for robotics. We train high-fidelity, action-conditioned world models from real robot data, so any policy can be benchmarked and improved without ever running on physical hardware.

The problem with evaluating robot policies

General robot policies are improving fast — but the way we measure them hasn’t. The dominant evaluation methodology is still “put it on a real robot, run a few rollouts, eyeball the results.” That has three problems.

  • It doesn’t scale. Each evaluation costs operator time, hardware wear, and resets between trials. Comparing two checkpoints meaningfully takes hours, not seconds.
  • It isn’t reproducible.Lighting, object placements, and contact dynamics drift between sessions, so two research groups can’t compare numbers directly.
  • It isn’t safe to optimize against. Reinforcement learning needs millions of rollouts. Doing those on hardware is slow, expensive, and dangerous.

The same problems were solved for game-playing agents and language models with simulators and benchmark suites. Robotics doesn’t have either yet — not because nobody tried, but because hand-built physics simulators don’t cover the full visual and contact distribution of the real world. Sim2real is hard for a reason.

A simulator learned end-to-end from real data

Instead of writing a simulator, we learn one. Given a starting image and a stream of actions, our model rolls out future frames that match what would have happened on the real robot. The physics isn’t hand-coded — every contact, every shadow, every cable and gripper finger is something the model has seen during training.

  • High-fidelity images.The simulator outputs 256×256 RGB frames at the same effective rate as the real teleop trajectories that trained it.
  • Action-conditioned.Drive it with the same end-effector commands you’d send a real bimanual setup; the rollout responds to your input frame-by-frame.
  • No real robot needed. Once the model is trained, every researcher gets the same physics — no shipping hardware, no calibration drift, no operator queue.
  • Apples-to-apples comparison. Two policies, one identical environment, one identical starting state. We plan to host these benchmarks publicly so the field can move from anecdotal to quantitative comparisons.

What it looks like

Two clips from the validation set. Left half is the real robot recording; right half is the model’s rollout from the same starting frame and the same action stream — never seen during training.

Ground truth
Generated

bimanual rope · validation episode br_0000

The clip above is from a proof-of-concept model we trained on the bimanual teleop dataset from Wang et al.’s Interactive World Simulator project. We took inspiration from that work but trained it with a different, more scalable architecture: a single diffusion-forcing transformer handling all four task families through one unified action head at 256×256 RGB. The two streams stay aligned through the full 20-second episode.

Since then we’ve scaled that architecture to a 16-DoF action space and four synchronized camera views — which required novel techniques for keeping the views physically consistent with each other. To our knowledge, the clip below is the first published world model that handles a production-scale bimanual setup end-to-end in a single network:

  • 16-DoF action conditioning. Joint-space commands, not a reduced end-effector parameterization.
  • Four synchronized camera views. Left wrist, right wrist, chest, and waist — generated jointly and consistent across views frame by frame.
  • First-party data. Collected by Fern on an OpenArm — the open-source 16-DoF bimanual robot from Anvil — not a retrofit of a public benchmark.
Ground truth
Generated

OpenArm · 4-view validation episode

All four views stay coherent with each other across the episode — gripper poses, cloth geometry, and scene background line up across cameras the way physics requires.

Try it yourself

The simulator below is the same model running live on a single cloud GPU. Click Start the demo, then steer the bimanual setup with the keyboard. Every frame you see is generated on demand from the actions you send — there’s no recorded video being replayed.

This demo is a proof of concept.It’s pinned to a single multi-task checkpoint trained on a public open dataset, served just to prove the architecture and the live-driving loop work end-to-end in a browser.

Note: the GPU is shared. If the canvas takes a moment to come live, it’s likely already in use by someone else.

Idle

Click the canvas, then drive the rope.

What’s next

What we’re actively working on:

  • Architecture R&D. We’re actively iterating on the model architecture for decreasing sim-to-real gap, longer horizon stability, lower per-frame latency, and sharper contact physics.
  • First-party data collection. We’re recording our own bimanual teleoperation in-house on a growing fleet of robots — expanding the base model’s coverage of grippers, contact regimes, and scene diversity well beyond what any single open dataset provides.
  • Custom world models for customers. Most robotics companies already have terabytes of teleoperation data sitting in cold storage from training their own policies. We re-purpose that data to fit a world model on their specific embodiments and tasks, so they can evaluate and RL-train their checkpoints against their physics, not a generic one.
  • Public benchmarks. A growing catalog of evaluation tasks — manipulation, navigation, mobile manipulation — hosted on this site. Submit a policy, get a leaderboard placement, see exactly where it succeeds and fails.
  • RL environments. The same world models exposed as Gym-style environments for offline + online RL. Train against learned physics, deploy to real robots without ever burning hardware time on bad policies.

We’re building this for three kinds of teams. Policy developers who want their checkpoints evaluated end-to-end on a managed cloud platform, without standing up their own robot fleet. RL researchers who want Gym-style environments backed by learned physics, so training runs don’t need real hardware in the loop. And robotics companies who want a custom world model fitted to their own embodiments and existing teleop data — so evaluation and RL training happen in their physics, not a generic one. If any of those describe you, reach out at founders@fern.bot.