EVE: A Generator-Verifier System for Generative Policies
Under Review
Lightning Talk✨ at ICRA Semantics for Reliable Robot Autonomy Workshop, 2026 (Slides)
Poster Presentation at ICRA Data to Decisions: VLA Pipelines for Real Robots Workshop, 2026
EVE: A generator-verifier interaction framework for generative embodied policies. Action feedback from an ensemble of zero-shot VLM-based verifier agents is incorporated into base-policy denoising through guided diffusion.
Visuomotor policies based on generative models such as diffusion and flow-matching have shown strong performance for robotics applications but degrade under distribution shifts, demonstrating limited recovery capabilities without costly finetuning. In the language modeling domain, test-time compute scaling has revolutionized the reasoning capabilities of modern LLMs by enabling candidate solution refinement. These methods typically leverage foundation models as verification modules in a zero-shot manner to score candidate solutions. We hypothesize that generative policies can similarly benefit from additional inference-time compute that employs zero-shot VLM-based verifiers in a generation-verification framework. To this end, we introduce EVE: a modular, generator-verifier interaction framework that boosts the performance of pretrained generative policies at test time, with no additional training. EVE wraps a frozen base policy with multiple zero-shot, VLM-based verifier agents. Each verifier proposes action refinements to the base policy candidate actions, while an action incorporator uses classifier guidance to fuse aggregated verifier feedback into action denoising. We study design choices for generator–verifier information interfacing across a system of verifiers with distinct capabilities. Across diverse simulated and real robotic tasks and embodiments, EVE consistently improves success rates without additional policy or verifier training. Through extensive ablations, we isolate the contribution of verifier capabilities and action incorporator strategies, offering practical guidelines to build scalable, modular generator-verifier systems for embodied control.
Rather than finetuning or retraining a base policy with expensive in-domain recovery data, EVE (Embodied Verifier Ensembles) leverages frozen, frontier vision-language models (VLMs) as zero-shot verifiers in a unified generator-verifier architecture. At each intervention point, EVE queries multiple verifiers with distinct action-feedback strategies and fuses their corrections into the base VLA through a classifier-guidance-based action incorporator. The framework is organized around four components:
The result is an inference-time action refinement algorithm that produces semantically grounded recovery feedback and interpolates it with the base VLA actiondistribution — requiring no finetuning of policy weights.
On the SimplerEnv real-to-sim benchmark, EVE with training-free, zero-shot verifiers outperforms state-of-the-art finetuned verifiers. EVE-Ensemble attains the best total average success rate of 72.2, improving over the base policy π0 (67.1) and over trained-verifier baselines RoboMonkey (68.1, trained on 20M synthetic preferences) and V-GPS (27.3, trained on 175K demos) — while using zero training data. The largest gains concentrate on harder, lower-success tasks such as Put Apple in Drawer (29.9 → 40.3) and Stack Blocks (56.2 → 63.2).
EVE with training-free, zero-shot verifiers outperforms state-of-the-art finetuned verifiers on SimplerEnv. Bold is best, underline is second-best.
EVE generalizes to new embodiments and diverse tasks where trained-verifier baselines (RoboMonkey, V-GPS) are not applicable, since they require extensive in-domain training. On the ManiSkill-HAB long-horizon mobile manipulation suite (mean over 1008 rollouts), EVE consistently improves a base diffusion policy. On RoboTwin 2.0 with an Aloha AgileX dual-arm embodiment, EVE improves the π0.5 VLA.
(a) ManiSkill-HAB mobile manipulation. (b) RoboTwin 2.0 bimanual manipulation (π0.5 VLA).
We validate EVE on a real Franka Emika Panda arm. We finetune the π0.5 VLA on 50 demonstrations of placing two blocks into a bin, and evaluate on 5 tasks (10 trials each): 1 in-distribution task, 2 OOD-Prompt tasks (pick blocks in a specific order), and 1 OOD-Object task (pick coffee pod). EVE matches RoboMonkey in-distribution and achieves the best performance across all OOD settings via semantically-informed verifier-guided corrections, while RoboMonkey degrades sharply in the OOD-Prompt setting.
Success rate (out of 10 trials) on the real Franka robot.
Sample rollouts: verifier intervention and base VLA rollout continuation after EVE intervention.
EVE with a smaller but newer Qwen3-VL-8B backbone still outperforms RoboMonkey, suggesting gains scale with stronger VLM backbones. Although EVE has higher per-step latency, its MMD trigger invokes the verifier only when the action distribution deviates significantly — and it verifies at the action-chunk level — yielding a lower average rollout time than RoboMonkey, which scores actions at every timestep.
EVE with smaller but newer VLM verifier backbones outperforms RoboMonkey on SimplerEnv, with a lower average rollout time despite higher per-step latency.
We ablate the core components of EVE on the SetTable-Place task from ManiSkill-HAB, reporting the delta in success rate (%) between steered and unsteered runs. (a) Verifier model scaling: larger VLM verifiers steer more effectively, with Qwen-2.5-VL-72B consistently beating 7B and 32B. (b) Action incorporator design: guided diffusion outperforms both verifier override and naive averaging, integrating just enough verifier feedback to prevent failure while staying near the base policy's action distribution. (c) Guidance ratio: performance peaks at a guidance coefficient of 10, with sharp drops at neighboring values.
(a) Verifier model scaling.
(b) Action incorporator design.
(c) Guidance ratio.
On a failed Place Apple in Drawer rollout, the base policy π0 only partially opens the drawer, causing the apple to fall out. EVE detects an MMD spike and intervenes: the Primitive steerer proposes a "Nudge Left" action toward the handle, while the Pivot steerer selects trajectories that maximize outward pulling of the drawer. The averaged feedback guides denoising so the drawer opens fully and the apple is placed successfully — showing how complementary verifier guidance recovers a failed trajectory.
On a failed Move Near rollout, the base policy executes a faulty grip and the can tips over. EVE detects an MMD spike during the approach phase. Here the Primitive steerer proposes a corrective "Nudge Left", while the Pivot steerer incorrectly selects an unhelpful trajectory. Because EVE aggregates feedback across verifiers, the averaged signal still yields a small leftward correction that achieves a stable grasp — illustrating EVE's robustness to imperfect verifier feedback.
@article{ali2025eve,
title={EVE: A Generator-Verifier System for Generative Policies},
author={Ali, Yusuf and Patlin, Gryphon and Kothuri, Karthik and Coholich, Jeremiah and Irshad, Muhammad Zubair and Liang, Wuwei and Kira, Zsolt},
journal={arXiv preprint arXiv:2512.21430},
year={2025},
url={https://arxiv.org/abs/2512.21430}
}