MIT proposes VPO: Vectorized rewards replace scalars to maintain diversity in LLM test-time search

On May 21, researchers including Ryan Bahlous-Boldi, a Ph.D. student at MIT CSAIL, published a paper on arXiv (arXiv:2605.22817) introducing the Vector Policy Optimization (VPO) algorithm. Current mainstream paradigms for post-training LLMs—such as GRPO—compress all reward signals into a single scalar value beforehand. This results in low entropy in the model’s output distribution and limited diversity of solutions; consequently, models perform poorly when needing to search for optimal answers among multiple candidates during inference (via metrics like pass@k or best-of-k). VPO’s core insight is that rewards inherently possess a vector structure in practice—for instance, in code generation, each test case yields its own pass/fail outcome, or there may be multiple user preference models. By randomly scalarizing these vectors and jointly training under different reward weight distributions, VPO ensures generated candidate solutions specialize toward distinct regions of the reward space, thereby boosting diversity while maintaining output quality. VPO can directly replace GRPO’s advantage estimator at relatively low implementation cost. On LiveCodeBench, VPO outperforms the GRPO baseline in terms of pass@k and maintains greater diversity within the reward space across various task domains. Soheil Feizi, a professor at the University of Maryland, commented on Twitter that the scalar reward perspective is “inherently lossy,” noting that VPO alongside methods like GEPA point toward redefining “rewards” as structured feedback objects.

arXiv | X (@RyanBoldi)