Do Sparse Autoencoders Identify Reasoning Features in Language Models?
Published in ICML, 2026
Sparsity-biased SAEs tend to latch onto low-dimensional cue tokens that co-occur with reasoning, and we find most contrastive “reasoning features” are largely explainable by such cues rather than robust reasoning signals.
Recommended citation: George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi (2026). Do Sparse Autoencoders Identify Reasoning Features in Language Models? In Forty-Third International Conference on Machine Learning. https://openreview.net/forum?id=TCFtA9CI3U
