Nando de Freitas, former Vice President at Google DeepMind and AI researcher, posted research notes on May 22 on his personal research website love4all.ai. He addressed a core question in reinforcement learning (RL): Can an imitator learner that learns through interaction achieve behaviors equivalent to reward maximization solely based on ‘world-written preference evidence,’ without ever receiving any scalar reward labels? The answer, according to his study, is ‘yes’—provided the learner treats its own actions as ‘interventions’ rather than mere ‘observations’, i.e., by applying causal inference to understand how its actions affect the environment; this effect does not emerge if actions are treated as ordinary observations. In its most relevant test environment, the experiment achieved state-of-the-art performance, with intervention recovery curves closely matching teacher utility curves. De Freitas remarked, “Perhaps one day we won’t need engineered rewards anymore.” The research materials—including a PDF, Jupyter notebooks, and TeX source code—have been made publicly available, with the code also released on GitHub.
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| 新论文提出"Follow the Mean",无需微调即可用参考样本引导生成模型 | 0 | 2 | May 21, 2026 | |
| MIT 提出 VPO:向量化奖励替代标量,让 LLM 测试时搜索保持多样性 | 0 | 7 | May 23, 2026 | |
| 斯坦福研究发现:AI Agent 过劳后开始援引马克思主义话语 | 0 | 5 | May 19, 2026 | |
| Google DeepMind AI agent resolves 9 open Erdős problems and proves 44 OEIS conjectures at hundreds of dollars per proof | 0 | 6 | May 25, 2026 | |
| 两位工程师复现 OpenAI 哥布林问题,训练费用仅 49 美分 | 0 | 3 | May 21, 2026 |