Former DeepMind VP Nando de Freitas: Pure imitation learning can lead to reward-maximizing behavior without needing handcrafted reward functions

Nando de Freitas, former Vice President at Google DeepMind and AI researcher, posted research notes on May 22 on his personal research website love4all.ai. He addressed a core question in reinforcement learning (RL): Can an imitator learner that learns through interaction achieve behaviors equivalent to reward maximization solely based on ‘world-written preference evidence,’ without ever receiving any scalar reward labels? The answer, according to his study, is ‘yes’—provided the learner treats its own actions as ‘interventions’ rather than mere ‘observations’, i.e., by applying causal inference to understand how its actions affect the environment; this effect does not emerge if actions are treated as ordinary observations. In its most relevant test environment, the experiment achieved state-of-the-art performance, with intervention recovery curves closely matching teacher utility curves. De Freitas remarked, “Perhaps one day we won’t need engineered rewards anymore.” The research materials—including a PDF, Jupyter notebooks, and TeX source code—have been made publicly available, with the code also released on GitHub.

love4all.ai | GitHub