On May 21, researchers including Ryan Bahlous-Boldi, a Ph.D. student at MIT CSAIL, published a paper on arXiv (arXiv:2605.22817) introducing the Vector Policy Optimization (VPO) algorithm. Current mainstream paradigms for post-training LLMs—such as GRPO—compress all reward signals into a single scalar value beforehand. This results in low entropy in the model’s output distribution and limited diversity of solutions; consequently, models perform poorly when needing to search for optimal answers among multiple candidates during inference (via metrics like pass@k or best-of-k). VPO’s core insight is that rewards inherently possess a vector structure in practice—for instance, in code generation, each test case yields its own pass/fail outcome, or there may be multiple user preference models. By randomly scalarizing these vectors and jointly training under different reward weight distributions, VPO ensures generated candidate solutions specialize toward distinct regions of the reward space, thereby boosting diversity while maintaining output quality. VPO can directly replace GRPO’s advantage estimator at relatively low implementation cost. On LiveCodeBench, VPO outperforms the GRPO baseline in terms of pass@k and maintains greater diversity within the reward space across various task domains. Soheil Feizi, a professor at the University of Maryland, commented on Twitter that the scalar reward perspective is “inherently lossy,” noting that VPO alongside methods like GEPA point toward redefining “rewards” as structured feedback objects.
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| 三机构团队提出生成建模新框架 VDT,将最优控制与最优传输统一为线性规划,路径更直、推理更快 | 0 | 4 | May 22, 2026 | |
| GRAM:将递归推理概率化,10M 参数 ARC-AGI-1 达 52% | 0 | 5 | May 20, 2026 | |
| 前 DeepMind 副总裁 Nando de Freitas:纯模仿学习可涌现奖励最大化行为,无需设计奖励函数 | 0 | 4 | May 22, 2026 | |
| 斯坦福研究发现:AI Agent 过劳后开始援引马克思主义话语 | 0 | 5 | May 19, 2026 | |
| DeepSeek-V4-Pro 限时折扣 5 月 31 日到期,官方宣布原价四分之一将成永久定价 | 0 | 4 | May 22, 2026 |