EVE: A Generator-Verifier System for Generative Policies

Visuomotor policies based on generative models such as diffusion and flow-matching have shown strong performance for robotics applications but degrade under distribution shifts, demonstrating limited recovery capabilities without costly finetuning. In the language modeling domain, test-time compute scaling has revolutionized the reasoning capabilities of modern LLMs by enabling candidate solution refinement. These methods typically leverage foundation models as verification modules in a zero-shot manner to score candidate solutions. We hypothesize that generative policies can similarly benefit from additional inference-time compute that employs zero-shot VLM-based verifiers in a generation-verification framework. To this end, we introduce EVE: a modular, generator-verifier interaction framework that boosts the performance of pretrained generative policies at test time, with no additional training. EVE wraps a frozen base policy with multiple zero-shot, VLM-based verifier agents. Each verifier proposes action refinements to the base policy candidate actions, while an action incorporator uses classifier guidance to fuse aggregated verifier feedback into action denoising. We study design choices for generator–verifier information interfacing across a system of verifiers with distinct capabilities. Across diverse simulated and real robotic tasks and embodiments, EVE consistently improves success rates without additional policy or verifier training. Through extensive ablations, we isolate the contribution of verifier capabilities and action incorporator strategies, offering practical guidelines to build scalable, modular generator-verifier systems for embodied control.

Rather than finetuning or retraining a base policy with expensive in-domain recovery data, EVE (Embodied Verifier Ensembles) leverages frozen, frontier vision-language models (VLMs) as zero-shot verifiers in a unified generator-verifier architecture. At each intervention point, EVE queries multiple verifiers with distinct action-feedback strategies and fuses their corrections into the base VLA through a classifier-guidance-based action incorporator. The framework is organized around four components:

Base Policy Candidate Generation. Given an instruction, observation, and proprioceptive state, the base VLA samples K diverse candidate action trajectories by denoising from independent noise samples.
Verifier Agents. A collection of verifier modules each provide structured and dsitinct action feedback. Generator-Agnostic verifiers operate only on sensor observations and select from predefined recovery action primitives (the Primitive steerer). Generator-Conditioned verifiers additionally consume a representation of the candidate trajectories and select among them (the Pivot steerer, built on PIVOT-style prompting). Verifier outputs are aggregated into a common trajectory representation using a weighted average.
Action Incorporator. Instead of overriding or naively averaging actions, EVE uses a guided diffusion process that steers denoising toward verifier-consistent behavior while preserving the base VLA's learned action prior. A guidance coefficient controls how much verifier feedback is injected at each reverse-diffusion step.
Intervention Detection. To avoid costly VLM calls at every step, EVE invokes verifiers only at automatically detected intervention points — whenever the cumulative Maximum Mean Discrepancy (MMD) of the sampled action distribution exceeds a threshold, flagging erratic or out-of-distribution behavior.

The result is an inference-time action refinement algorithm that produces semantically grounded recovery feedback and interpolates it with the base VLA actiondistribution — requiring no finetuning of policy weights.

On the SimplerEnv real-to-sim benchmark, EVE with training-free, zero-shot verifiers outperforms state-of-the-art finetuned verifiers. EVE-Ensemble attains the best total average success rate of 72.2, improving over the base policy π₀ (67.1) and over trained-verifier baselines RoboMonkey (68.1, trained on 20M synthetic preferences) and V-GPS (27.3, trained on 175K demos) — while using zero training data. The largest gains concentrate on harder, lower-success tasks such as Put Apple in Drawer (29.9 → 40.3) and Stack Blocks (56.2 → 63.2).

EVE generalizes to new embodiments and diverse tasks where trained-verifier baselines (RoboMonkey, V-GPS) are not applicable, since they require extensive in-domain training. On the ManiSkill-HAB long-horizon mobile manipulation suite (mean over 1008 rollouts), EVE consistently improves a base diffusion policy. On RoboTwin 2.0 with an Aloha AgileX dual-arm embodiment, EVE improves the π_0.5 VLA.

We validate EVE on a real Franka Emika Panda arm. We finetune the π_0.5 VLA on 50 demonstrations of placing two blocks into a bin, and evaluate on 5 tasks (10 trials each): 1 in-distribution task, 2 OOD-Prompt tasks (pick blocks in a specific order), and 1 OOD-Object task (pick coffee pod). EVE matches RoboMonkey in-distribution and achieves the best performance across all OOD settings via semantically-informed verifier-guided corrections, while RoboMonkey degrades sharply in the OOD-Prompt setting.

EVE with a smaller but newer Qwen3-VL-8B backbone still outperforms RoboMonkey, suggesting gains scale with stronger VLM backbones. Although EVE has higher per-step latency, its MMD trigger invokes the verifier only when the action distribution deviates significantly — and it verifies at the action-chunk level — yielding a lower average rollout time than RoboMonkey, which scores actions at every timestep.

We ablate the core components of EVE on the SetTable-Place task from ManiSkill-HAB, reporting the delta in success rate (%) between steered and unsteered runs. (a) Verifier model scaling: larger VLM verifiers steer more effectively, with Qwen-2.5-VL-72B consistently beating 7B and 32B. (b) Action incorporator design: guided diffusion outperforms both verifier override and naive averaging, integrating just enough verifier feedback to prevent failure while staying near the base policy's action distribution. (c) Guidance ratio: performance peaks at a guidance coefficient of 10, with sharp drops at neighboring values.

On a failed Place Apple in Drawer rollout, the base policy π₀ only partially opens the drawer, causing the apple to fall out. EVE detects an MMD spike and intervenes: the Primitive steerer proposes a "Nudge Left" action toward the handle, while the Pivot steerer selects trajectories that maximize outward pulling of the drawer. The averaged feedback guides denoising so the drawer opens fully and the apple is placed successfully — showing how complementary verifier guidance recovers a failed trajectory.

On a failed Move Near rollout, the base policy executes a faulty grip and the can tips over. EVE detects an MMD spike during the approach phase. Here the Primitive steerer proposes a corrective "Nudge Left", while the Pivot steerer incorrectly selects an unhelpful trajectory. Because EVE aggregates feedback across verifiers, the averaged signal still yields a small leftward correction that achieves a stable grasp — illustrating EVE's robustness to imperfect verifier feedback.

BibTeX

@article{ali2025eve,
  title={EVE: A Generator-Verifier System for Generative Policies},
  author={Ali, Yusuf and Patlin, Gryphon and Kothuri, Karthik and Coholich, Jeremiah and Irshad, Muhammad Zubair and Liang, Wuwei and Kira, Zsolt},
  journal={arXiv preprint arXiv:2512.21430},
  year={2025},
  url={https://arxiv.org/abs/2512.21430}
}

EVE: A Generator-Verifier System for Generative Policies

Abstract

Approach

Results

Performance on SimplerEnv

Generalization to Long-Horizon Mobile Manipulation & Bimanual Tasks

Real-World Robot Experiments

Latency Analysis

Ablations

Qualitative Results: How EVE Recovers Failed Trajectories

Complementary Verifier Capabilities

Robustness via Ensembling

BibTeX