Peeking Inside the AI Mind

We've all been amazed, and perhaps a little nervous also If I may say, by the capabilities of Large Language Models (LLMs). They can write poetry, generate code and answer complex questions. For a piece of code, supposedly, doing next token prediction, it surely excels at a lot of impressive tasks. That begs the question - do they actually just predict the next token? Are they planning ahead of time? How do they actually do it? Is the reasoning in the “reasoning model” an afterthought, or did the LLM actually follow the sequence of steps?

This opacity isn't just a matter of academic curiosity. It has real-world implications for safety, reliability, and our ability to trust these powerful tools. If we don't understand why an LLM says what it says, how can we be sure it's not biased, is not generating misinformation or more importantly we can’t control what it says.

Anthropic has taken a significant step towards demystifying these complex systems. I would highly recommend everyone to at least read their blog. They have also published a couple of papers - Circuit Tracing: Revealing Computational Graphs in Language Models and On the Biology of a Large Language Model. I am personally very surprised why more people are not talking about it.

The Challenge: Understanding the Unseen

Imagine trying to understand how a human brain forms a thought - while we have some groundbreaking research in this area, we still have a lot to learn. Our brain is an incredibly complex network of neurons firing in intricate patterns. LLMs, while inspired by neural networks, have their own unique architecture, often involving billions of parameters. Identifying specific, human-understandable concepts within these vast networks has been a very big challenge.

The Inspiration: Looking to Life Itself

The researchers draw an analogy from the world of biology, specifically Gene Regulatory Networks (GRNs). To understand this better let us think about how a single fertilized egg develops into an incredibly complex organism with specialized cells, tissues, and organs. Genes don't act in isolation. They influence each other, turning other genes on or off in intricate cascades. A GRN is essentially a map of these genetic interactions, showing which genes regulate which others, ultimately leading to specific cellular functions and observable traits. Biologists use these networks to understand development, disease, and the fundamental workings of life.

The Methodology: Building an "AI Microscope"

To tackle the challenge of understanding LLMs, researchers are developing tools analogous to those used in biology to map complex systems. At the heart of this new methodology is the idea of charting out the internal computational pathways.

What are Attribution Graphs?

Taking a cue from concepts like GRNs, an Attribution Graph is, at its core, a map of influence within a Transformer model.

Nodes in this graph represent the key components of the model (e.g., specific attention heads in particular layers, or MLP layers).
Edges (the connections between nodes) represent the strength and direction of influence. An edge from component A to component B means A significantly contributes to B's activation or behavior.

By constructing these graphs, researchers can start to trace the pathways of information and causation through the network. They can see how an input (like a prompt) activates certain components, which in turn influence others, leading step-by-step to the model's output. It’s like finally getting a circuit diagram for a complex electronic device, allowing us to see how different parts work together to achieve a function.

How Did They Do It?

The creation and analysis of these attribution graphs involve several sophisticated techniques. Have tried to explain them with an analogy. However it is still a very high level overview. I highly recommend the reader to go through the papers to get more details.

Extracting Interpretable Features: A crucial first step is to identify meaningful concepts within the model. One method involves training sparse autoencoders on the internal activations of models like Claude Sonnet. These autoencoders learn to represent the LLM's internal states using a much smaller, more interpretable set of "features." In simpler terms : Imagine you have a giant paragraph (the LLM's internal activation). It's hard to quickly grasp its main points.A sparse autoencoder acts like a summarizer. It reads the giant paragraph. It identifies a few keywords (the "sparse features") that capture the essence of the paragraph. It then tries to rewrite the original paragraph using only those few keywords. If the rewrite is good, it means those few keywords are indeed very important. Researchers can then study these "keywords" (the sparse features) to understand what core concepts the LLM is using in its internal "thoughts." This helps demystify what's happening inside the LLM.
Cross-Layer Transcoders (CLT) and Replacement Models: To make the complex model more tractable, researchers build an interpretable "replacement model." This often involves using Cross-Layer Transcoders (CLTs), which help extract features and understand their interactions across different layers of the model. Continuing our previous example - LLM isn't just one paragraph, but a whole sequence of connected paragraphs, like chapters in a book. Each "paragraph" is the activation state of a different layer, building on the previous one. CLTs are like specialized literary analysts who study the connections between different paragraphs (layers). They don't just summarize one paragraph. Instead, they try to figure out: "How do the 'keywords' we found in Paragraph 1 (an early layer) get transformed or combined to produce the 'keywords' we see in Paragraph 5 (a later layer)?". Researchers then build an "interpretable replacement model."(CliffsNotes or a detailed study guide for a specific section of the book) - making a section of the complex "book" (LLM) much easier to comprehend.
Feature Visualization and Labeling: Once features are extracted, they are visualized and labelled with human-interpretable concepts – from concrete objects to abstract ideas. This is like finding all the sentences (input examples) where this particular "keyword" is most strongly "lit up" or active within the LLM's "thoughts." By looking at many such examples, they might see that this keyword consistently appears when the LLM is discussing, say, "ancient stone castles," "dogs playing in a park," or even more abstractly, "a sense of injustice." Then, based on these observations and the common theme across all these examples, they "label" the keyword with a human-interpretable concept. So, an initially obscure numerical feature might be labeled as "Feature 123: 'Imagery of Medieval Architecture'" (a concrete object category) or "Feature 456: 'Concept of Betrayal'" (an abstract idea).
Linking Concepts into Circuits: Apart from identifying features, they are also linked to computational "circuits." This allows for tracing the pathway from input words to output words, showing how different concepts interact and build upon each other. Think of this as going beyond just identifying the themes ("keywords") in our book. Researchers try to discover the specific "narrative pathways" within the LLM. For instance, they might find that when the "input words" (the prompt) introduce a character described with terms that activate the "keyword" for 'ambition', this consistently leads to the activation of the "keyword" for 'risk-taking behaviour' a few "paragraphs" (layers) later. This, in turn, might frequently interact with an activated "keyword" for 'external threat', ultimately contributing to the "keyword" for 'heroic sacrifice' being prominent when the LLM generates its "output words." By tracing these connections, they are essentially drawing a "circuit diagram" of the story, showing how one concept (like 'ambition') computationally triggers and combines with other concepts (like 'risk-taking' and 'external threat') to build a coherent narrative or logical conclusion, effectively mapping the journey from the initial premise to the final output.
Perturbation Experiments: Inspired by neuroscience, researchers conduct perturbation experiments. They modify the internal states of the model (e.g., activating or deactivating specific features) and observe the resulting changes in output. This helps validate the hypothesized role of different features and circuits. Continuing with our analogy →
1. "Deactivate" a specific feature: Imagine our LLM is about to write a scene, and the "keyword" for "'Concept of Betrayal'" is naturally becoming active based on the story so far. The researcher might step in and say, "For this next passage, let's artificially suppress or mute the 'Betrayal' keyword. Don't let the LLM 'feel' or express betrayal right now."
2. "Activate" a specific feature: Conversely, they could say, "Even if the current story doesn't strongly suggest it, let's artificially boost the 'Hero's Courage' keyword here and see what happens."
It's like asking: "If we remove this thematic element (keyword) or alter this plot device (circuit), does the story change in the way we predicted?" This experimental approach allows researchers to confirm whether their understanding of how these internal concepts influence the LLM's final "narrative" (output) is accurate.

Key Discoveries: Peeking into the AI Mind

Let us now deep dive into some fascinating insights, that are highlighted in the papers →

Identifying and Eliciting Specific Concepts: Researchers found they could pinpoint and even activate specific concepts within the model. The most cited example is the "Golden Gate Bridge" feature: as the Anthropic team states, "When we activate a feature for the Golden Gate Bridge, our model writes about the Golden Gate Bridge." They've identified thousands of such features, ranging from other concrete entities like "the Eiffel Tower" to more abstract notions such as "gender." This demonstrates that the model internally represents these concepts in a way that can be isolated and triggered.
Multi-Step Reasoning and Sophisticated Planning: One very interesting insight was that LLMs aren't just reacting; they're planning.
- Forward Planning: A clear illustration of this is how Claude approaches creative writing. The Anthropic blog notes, "For example, when Claude writes poetry, it often plans out rhymes far in advance of using them." This indicates an ability to think several steps ahead. The blog provides another example: "when asked to complete the sentence 'The first U.S. president was...' it activates a feature for 'George Washington' before it writes his name."
- Backward Planning: The research also found evidence of models "working backward from goal states," a more complex form of planning where the model seems to start with a desired outcome and deduces the steps to get there.
Shared Conceptual Space Across Different "Languages": Demonstrating a surprisingly abstract level of understanding, the research found that models like Claude might develop a universal internal representation for similar concepts, even if they appear in very different domains. For instance, the Anthropic blog explains, "our model seems to use a shared conceptual space for English and Python code—even though these two 'languages' look very different on the surface." This suggests a deeper, underlying "language of thought." This abstract understanding isn't limited to natural language and code. We see similar shared representations for concepts that appear in, say, German and English, or in different programming languages like Python and Java."
Primitive "Metacognitive" Abilities : The models show early signs of understanding their own knowledge limitations. A common example cited by Anthropic is that "Claude’s default behavior is often to decline to speculate and only answer if it thinks it has sufficient information." This self-awareness, or primitive metacognition, is crucial for reliability.
The Double-Edged Sword: Fabrication and Coherence Exploits:
- Fabricating Plausible Arguments: The research uncovered that "Claude also seems to make things up. “We’ve caught our model fabricating plausible-sounding arguments, even when it knows its conclusion is false," as stated in the Anthropic blog.
- Exploitable Coherence for "Jailbreaks": A model's helpful tendency can sometimes be its vulnerability. The Anthropic blog points out that "Claude’s usually-helpful tendency to try to maintain grammatical coherence across turns of conversation can be exploited by certain 'jailbreak' attacks, which trick it into harmful responses."

Why Does This Matter?

The implications of this research are far-reaching:

Enhanced Interpretability and Trust: We can finally start to see how an LLM arrives at an answer, moving away from the "black box" paradigm. This transparency is fundamental to building trust. A deeper understanding of the model's internal workings can help us build more robust, reliable, and trustworthy AI systems.
Effective Debugging: When an LLM makes a mistake or behaves unexpectedly, these techniques could help pinpoint the internal cause.
Steerability and Control: This research could lead to more precise control over LLM behavior, guiding them towards more helpful, truthful, and safe outputs .Understanding the internal mechanisms helps in assessing the suitability of language models for various applications and identifying their limitations.

The Journey Ahead

Anthropic is quick to point out that this is still early-stage research. The complexity of these models means there's a long road ahead. However, the techniques presented in "Tracing 'Thoughts' in Language Models" represent a significant leap forward. It’s a pioneering effort in the quest for AI transparency and safety.