|
MIT proposes VPO: Vectorized rewards replace scalars to maintain diversity in LLM test-time search
|
|
0
|
5
|
May 23, 2026
|
|
A team from three institutions proposes VDT, a new generative modeling framework that unifies optimal control and optimal transport via linear programming — resulting in shorter pathways and faster inference.
|
|
0
|
2
|
May 22, 2026
|
|
Former DeepMind VP Nando de Freitas: Pure imitation learning can lead to reward-maximizing behavior without needing handcrafted reward functions
|
|
0
|
3
|
May 22, 2026
|
|
Alibaba releases closed-source model Qwen3.7-Max; increases investment in reinforcement learning computing power
|
|
0
|
5
|
May 21, 2026
|
|
Two engineers reproduced OpenAI’s Goblins issue at a training cost of just 49 cents
|
|
0
|
2
|
May 21, 2026
|