Falsifying Sparse Autoencoder Reasoning Features in Language Models
Published in ICML, 2026
Sparsity-biased SAEs tend to latch onto low-dimensional cue tokens that co-occur with reasoning, and we find most contrastive “reasoning features” are largely explainable by such cues rather than robust reasoning signals.
Recommended citation: George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi (2026). Falsifying Sparse Autoencoder Reasoning Features in Language Models. In Forty-Third International Conference on Machine Learning. https://openreview.net/forum?id=TCFtA9CI3U
