Blog Posts

Do k-Sparse Autoencoders Reveal Thinking Patterns? Interpretable Features in a Small Reasoning Model

This article explores whether k-Sparse Autoencoders, a specific type of models used primarily for Mechanistic Interpretability, are able to extract interpretable features related to the thinking process of a small reasoning model. Through experimenting with DeepSeek R1 Distill Qwen 1.5B and EleutherAI's k-SAEs, it was found that certain latent features strongly activate in response to tokens used by the model in the reasoning process.