Revising and Falsifying Sparse Autoencoder Feature Explanations
Published in NeurIPS, 2025
We developed new methods to refine and falsify sparse autoencoder feature explanations, yielding higher-quality interpretability of large language models.
Recommended citation: George Ma, Samuel Pfrommer, Somayeh Sojoudi (2025). Revising and Falsifying Sparse Autoencoder Feature Explanations. In Thirty-ninth Conference on Neural Information Processing Systems. https://openreview.net/forum?id=OJAW2mHVND
