Goodfire paper: Large model concepts form curved manifolds; SAE approximates them via "fragment tiling" and reconstructs them as an inverse Ising problem

On April 30, the Goodfire research team published a paper titled ‘Do Sparse Autoencoders Capture Concept Manifolds?’ on arXiv, which systematically examines the geometric structure of concepts within large language models. The study reveals that concepts inside these models aren’t organized along independent linear directions as assumed by the prevailing ‘linear representation hypothesis’; instead, they form high-dimensional manifold structures. Sparse autoencoders (SAEs), currently key tools in interpretability research, cannot directly capture such curved structures. Instead, they approximate them by ‘tiling’ and ‘shattering’ these manifolds into multiple linear segments. This paper formally characterizes this mechanism and illustrates it through visualizations of concept manifolds spanning historical timelines from 1800 to 1998.

Additionally, the paper reframes unsupervised manifold discovery as an ‘inverse Ising problem’, leveraging inference frameworks from statistical physics to provide a more analytically tractable theoretical foundation. Goodfire has also open-sourced an automated shape-search tool capable of identifying the underlying geometric structures within model activations; its Silico platform offers managed manifold-discovery services. Currently, Goodfire’s SAE tools are used to analyze internal representations in models like Llama 3.3 70B. This research delivers a more systematic theoretical explanation regarding SAE limitations from a geometric perspective, while pointing toward the next frontier in mechanistic interpretability: unsupervised recovery of feature geometry directly from activation values.

GoodfireAI on X | arXiv 2604.28119