Clear AI Blog

MCP Resources Are Underrated

Ankit Solanki — Thu, 11 Dec 2025 00:00:00 GMT

MCP is here to stay. It's become a Schelling point for the community, and I feel it will continue to gain traction. Just as I was writing this, I noticed that Anthropic has donated stewardship of the MCP protocol to the Agentic AI Foundation.

We've been thinking deeply about our MCP integration story, and I have recently started to build strong conviction that MCP Resources are criminally underrated. To understand why resources are useful, let's first understand the problem with MCP tools.

MCP Tools have a context problem

A typical MCP integration works like this:

The agent's harness exposes MCP tools to the LLM provider
In response to a completion API, the LLM may invoke an MCP tool
This tool's response is given back to the LLM

An MCP server could expose any tool. For example: a Google Drive MCP could be exposing files on a connected Google Drive; and these files could wildly vary in sizes.

What happens if the tool response is huge?

Naive solution: send the whole response to the LLM? This is expensive: maybe the tokens consumed here could be better used elsewhere? Maybe the response is larger than the LLM supports and your agent crashes?

Another option is truncation. Either the MCP server itself could truncate large results, or the agent harness should anyway truncate results over certain sizes.

This gives you safety: MCP tool calls cannot consume unlimited tokens. But this comes at the cost of context loss: who decides what to truncate? What if the most critical information needed for the agent was truncated?

MCP Resources are a solution

The MCP protocol has support for resource templates:

Resource templates allow servers to expose parameterized resources using URI templates.

Resources are exposed as URIs. Critically: a tool response could contain resource links, and we have the agent harness access the entire resource.

This means:

Large responses from tools can be converted to resources
Agents and agent harnesses can inspect resources directly
Depending on your system design, you could compose these together with other LLM tools and get to interesting emergent behaviour.

Here's an example:

// Request
{
  "method": "tools/call",
  "params": {
    "name": "read_google_drive_file",
    "arguments": {
      "id": "c85b4851-bd61-4352-84a2-6e0c6b6d3dce"
    }
  }
}

// Response without Resources
{
  "content": [
    {
      "type": "text",
      "text": "...... huge text"
    }
  ]
}

// Response with Resources
{
  "content": [
    {
      "type": "text",
      "text": "...... small text snippet"
    },
    {
      "type": "resource_link",
      "uri": "gdrive://file-id-123"
    }
  ]
}

This can enable composition:

The agent's harness could read the resource and save the file to disk
The agent could then use other local tools (eg: read_file, grep, execute_code, etc) to further process this file.

MCP Resource Links

The MCP protocol in fact has a first-party way of doing this: resource links:

A tool MAY return links to Resources, to provide additional context or data. In this case, the tool will return a URI that can be subscribed to or fetched by the client.

Most MCP servers don't expose resources though. This part of the protocol seems really unexplored and under-appreciated.

Addendum: Code Execution

Anthropic recently wrote about a different pattern: using code execution to interact with MCP servers. This is a really powerful pattern, though I feel it's something that can be combined with MCP Resources.

MCP servers are fast becoming a glue layer
Any given MCP server doesn't know which type of client would connect to it.
The server doesn't know if the client is using code execution or just LLM tool calls.
Thus, well-behaved servers need to be defensive about context usage in tool calls, and large responses will end up being truncated somehow.
Resources are the officially blessed way to expose data to clients.

Code execution doesn't remove the need for MCP resources. It makes MCP resources even more useful!

We're adopting this MCP resource-link pattern heavily in servers we control. Our agents will support this pattern as well. Unfortunately, MCP Server authors need to start adopting this pattern for it to gain traction though.

We're hoping that this post can spark a discussion about this.

MCP Integration Patterns

Ankit Solanki — Fri, 26 Sep 2025 00:00:00 GMT

We're building an agent system and I have been evaluating adding support for MCP.

We really care about our tool design to an unreasonable degree. I have personally spent days thinking about the right design for some of our tools.

I couldn't help but over-think exactly how MCP should be exposed to our agents. At a high level, I felt that there could be two integration patterns — I would name them the 'meta tool pattern' and 'materialised tools pattern'.

Meta Tool Pattern

The meta tool pattern: the agent sees a few meta tools like 'list available MCP servers', 'describe MCP server', 'invoke MCP tool'.

The agent can decide to explore the capabilities of an MCP server and invoke its tools, when necessary. This design is flexible and efficient, but also requires a more capable agent:

This pattern uses less context by default. Tool definitions are lazily loaded only when required.
The 'discovery' calls for individual MCP servers need to happen only once in a given chat thread.
There's no guarantee that the agent will decide to explore the installed MCP servers though.

Materialised Tools Pattern

Here, whenever a MCP server is enabled – the system will automatically discover all available tools in the specific MCP server and eagerly copy them to the list of tools available to the agent. The agent sees all of these tools by default. Calling an MCP tool is just like calling any other tool.

This pattern makes tools really explicit to the agent.
This comes at the cost of using additional context, even when it's not necessary.
If multiple MCPs are installed, you may fast run out of context space.

Evaluating these patterns

If you were designing an agentic platform, which option would you choose? I did a survey of some existing open source systems and here's what I found:

codex-cli also uses the materialised tools pattern
opencode uses the materialised tools pattern
Cline uses a hybrid
- All tools exposed by enabled MCP servers are copied to the system prompt
- A single use_mcp_tool tool is used to invoke them
OpenManus also uses the materialised tools pattern.

This was a surprise. I'm not sure why the existing implementations are so heavily skewed towards materialised tools!

Our Decision

At this point I'm inclined to go with the meta tool pattern — it seems to make sense the system we're building.

What I have noticed is that:

Most MCP servers don't have a great agent interface.
They expose too many fine grained tools with overlapping responsibilities.
They are really wasteful of context tokens.

Most importantly: the meta tool pattern just intuitively feels like the right solution to me.

There are 100s of small details like this that go into building great products – and I have a feeling that these details are actually what differentiates your product when everyone is building on the same foundation models.

The Mythical Agent Month

Ankit Solanki — Fri, 27 Jun 2025 00:00:00 GMT

The Mythical Man Month famously had the observation that adding manpower to a project that's behind schedule will often delay it even further. Additionally, in 'No Silver Bullet' Fred Brooks further states that:

There is no single development, in either technology or management technique, which by itself promises even one order-of-magnitude improvement within a decade in productivity, in reliability, in simplicity.

How does this change with the advent of AI coding agents? Can coding agents give us the mythical 10x speedup?

My thoughts below. I've divided this blog post into sections which argue both for and against transformative change.

My position: coding agents are a fundamental shift in how we'll build software over the next few decades, and we're overestimating the impact in the short term while underestimating the impact in the long term.

Headwinds

Vibe Coding vs Writing Production Software

We should differentiate between 'vibe coding' as a way to experiment / prototype, and using AI to write production software. Simon Willison has an excellent post on this.

For production quality software, all basic tenets of software engineering apply. Code reviews, tests, architecture designs, software design reviews, etc.

Most of a senior engineer's time is often spent in these activities, not just writing code.

Coding agents help a great deal here, but you won't see the same speedup as pure vibe coding — where you build software without reviewing the code your agent writes

At least for now: coding agents aren't good enough to fully autonomously build features and ship them to production without human review.

Essential Complexity vs Incidental Complexity

As Fred Brooks pointed out in the above essay, software development consists of both essential complexity and incidental complexity.

Incidental complexity could be things like figuring out how to write a Dockerfile, or learning how a specific library works, or dealing with framework specific issues. Coding agents can be a huge help here.

Essential complexity is the core problem you're trying to solve. Coding agents can definitely help here, but you still need to pay close attention — humans will remain the bottleneck here.

Amdahl's Law basically gives a ceiling for the performance gain that automation / parallelisation gives for any given task. You are only as fast as your bottleneck.

Decision Fatigue and Time Compression

Faster coding actually compresses timeframes and lets you focus on the hard decisions, on the essential complexity. Coding agents let you focus on substance of your problem.

But human capacity for deep thought is limited!

So now, your day to day working with AI coding tools is going to be a series of hard-decisions that you need to think deeply upon, decisions that require high amount of mental effort.

Decision fatigue is real. If you have to make a weeks' worth of hard decisions in a day, your decision quality will suffer.

AI coding will exhaust you if you're not careful. Human beings need to be able to step back and think about problems. We need to go for walks, ruminate on ideas and just wander through a problem space.

Effective Communication & User Skill

AI agents need engineers to be effective communicators, and this is a problem. Most engineers aren't the best communicators. Every great coder isn't automatically great at delegation.

Effective communication is a skill. Writing clearly is a skill. And using coding agents effectively is a skill.

For example, here are two recent articles that go in-depth about the craft of using AI agents to code:

Craftsmanship takes time. Skills take time to build. People who are great engineers today won't automatically be great at using coding agents. Getting better will require deliberate practice, and approaching this problem with a beginner's mindset.

Headwinds Summary

Given that:

Production software (currently) requires human oversight
Essential complexity remains
It will take time to learn how to use the coding agents effectively

Is an immediate 10x improvement in velocity possible? It seems there is truly no silver bullet.

Tailwinds

AI Scaling will continue

Agents will keep getting better. Underlying models will keep getting better. We've learned to not bet against scaling.

According to one recent viral benchmark, the length of tasks that AI can do uninterrupted is doubling every 7 months.

From my personal experience, I know that each recent big model release (eg: Sonnet 3.5, Sonnet 3.7, Sonnet 4.0) has made building agents easier. The LLMs are getting better at following instructions, at using tools, at planning, and just at showing agency.

It's hard to predict the future, but it's definitely possible that soon, a large majority of code written won't need human review and oversight.

Most work isn't 'Deep Work'

While my arguments above hold (essential complexity remains, decision fatigue is real) — let's be real and acknowledge the fact that most of us don't do deep work 100% of the time.

A lot of time goes into glue work, into getting various subsystems to behave, dealing with broken tools, etc.

AI can be a huge accelerator for these types of work. It's possible that this itself can be a 10x improvement for many organisations!

Quantity has its own Quality

If you're working in deep tech, if you're building something complex — AI coding allows you to try more approaches. You can build quick and dirty throwaway prototypes, and validate more ideas.

Quantity has a quality of its own. If you can do more iterations, you can get to better decisions. If you can actually build multiple candidate systems, you can make more informed choices — architecture / design decisions become easier with data.

I have personally seen this pay off: while building a zero to one product, I have been able to do many parallel experiments and actually test 10s of ideas before deciding upon a plan. This has enabled me to make bold system design bets with high confidence.

Ambition & Moonshots

AI agents allow you to be more ambitious. On the margin, with higher productivity it's possible to devote more time and resources towards building something better than you would have previously.

You can think of a productivity gain as either:

Work 10x faster
Build something 10x better

Either option is fine! In fact, it may be the case that building something 10x better is actually more impactful.

I suspect one of the impacts of ubiquitous coding agents will be the rising baseline quality of software!

Tailwinds Summary

If you consider the facts that:

LLMs will continue to improve
We'll all get better at using AI agents
We'll be able to automate away low impact work
We'll be able to try many more iterations
We'll be able to build more impactful, more meaningful software

How can you doubt the impact that coding agents will have?

Conclusions

I've argued both sides of this. My position is:

We're radically underestimating coding agents
Most of us are not ready to adopt agents at scale

I think impact and adoption will not be uniform. People will have different lived experiences with AI tools, with some dismissing AI coding as a fad, and some enthusiastically thinking of these agents as a panacea for all their problems.

I think coding agents will have a huge impact that is overrated in the short term, but underrated in the long term.

And I think that today, to get the most out of current generation agents, you have to really dive deep and uncover their limits yourself.

General Purpose vs Specialised AI Agents

Ankit Solanki — Tue, 10 Jun 2025 00:00:00 GMT

Should AI agents be general purpose or should they be built for a specific task?

I would say yes to both.

Special Purpose Agents

A special purpose agent is an agent built to do one (or a few things) very precisely. Examples:

RAG on a knowledge base and answer questions only from the knowledge base.
Convert images to very specific structured data format (eg: convert image to a contact vCard).
Convert text to SQL on a specific table, with specific guardrails (eg: always apply page size limits).

These agents are built to do a specific job.

General Purpose Agents

A general purpose agent on the other hand would be an agent that has emergent behaviour, that can use its tools to do something it wasn't explicitly programmed for.

This is best demonstrated by an example. I recently coded up a toy agent that runs on the command line that had the following tools available:

List directory
Read file
Query file (via duckdb)
Read PDF

This agent was designed to have generic tools, and it was allowed to do multiple tool calls if needed. It wasn't given a specific goal.

The lack of specificity actually made the agent more useful! In the last few days, I have used it to do:

K8S cost optimisation — given a CSV containing some kubernetes utilisation data, this agent helped me find low hanging cost optimisation options
Data entry for my own personal finance needs — given some account statements, the agent was able to help me digitise them in a format I use for tracking my expenses
Do Q&A on my meeting notes

Interestingly, the agent showed a lot more agency than I was expecting. For example, it would often execute multiple tool calls in response to a general 'hello' message!

All the recent wow moments I have had with AI are usually with general purpose agents. They can end up giving you unexpectedly rich experiences.

Use Cases & Trade-offs

If you're building an AI enabled product, both special purpose agents and general purpose agents have their place.

You sometimes want determinism (or close to determinism) and repeatability.
- For example: if you're building a support desk, you may want to always categorise tickets in a certain way.
- In such scenarios, special purpose agents are really useful. You can treat them as "intelligence that's an API call away".
Special purpose agents are limited in scope.
General purpose agents are where the power of AI agents becomes apparent.
General purpose agents are going to be more expensive to build.
- You may need smarter LLMs, you may need to spend tokens on reasoning.
- General purpose agents could use up millions of tokens.
Shipping general purpose agents requires you to have the right underlying infrastructure.
- The agent is as powerful as the tools it has access to. You need to build the right primitives to unshackle the AI model.
- You need to build the right platform for the AI agents to leverage!
  - For example: if your system processes a lot of data, you may need to invest in robust parallel execution.
  - If the agent could 'ask any question' of your data, you have to think about indexing and database performance.
The ideal end state of a general purpose agent is a coding agent — an agent that can write code for a specific task and then execute it.
- This could lead to potential security issues, and you have to look out for other types of abuse.
- You might need to invest in primitives like sandboxing here.

I see this as a continuum — the more general an agent, the more powerful it is, but you need to invest proportionally into building the right safeguards.

Most use cases may not need a general purpose agent. You can probably start off building an AI-enabled product by just focusing on special purpose agents.

General purpose agents are also the most valuable ones though. And building general purpose agents will require you to invest proportionally into your platform's core primitives.

Peeking Inside the AI Mind

Vinay Shashidhar — Wed, 28 May 2025 00:00:00 GMT

We've all been amazed, and perhaps a little nervous also If I may say, by the capabilities of Large Language Models (LLMs). They can write poetry, generate code and answer complex questions. For a piece of code, supposedly, doing next token prediction, it surely excels at a lot of impressive tasks. That begs the question - do they actually just predict the next token? Are they planning ahead of time? How do they actually do it? Is the reasoning in the “reasoning model” an afterthought, or did the LLM actually follow the sequence of steps?

This opacity isn't just a matter of academic curiosity. It has real-world implications for safety, reliability, and our ability to trust these powerful tools. If we don't understand why an LLM says what it says, how can we be sure it's not biased, is not generating misinformation or more importantly we can’t control what it says.

Anthropic has taken a significant step towards demystifying these complex systems. I would highly recommend everyone to at least read their blog. They have also published a couple of papers - Circuit Tracing: Revealing Computational Graphs in Language Models and On the Biology of a Large Language Model. I am personally very surprised why more people are not talking about it.

The Challenge: Understanding the Unseen

Imagine trying to understand how a human brain forms a thought - while we have some groundbreaking research in this area, we still have a lot to learn. Our brain is an incredibly complex network of neurons firing in intricate patterns. LLMs, while inspired by neural networks, have their own unique architecture, often involving billions of parameters. Identifying specific, human-understandable concepts within these vast networks has been a very big challenge.

The Inspiration: Looking to Life Itself

The researchers draw an analogy from the world of biology, specifically Gene Regulatory Networks (GRNs). To understand this better let us think about how a single fertilized egg develops into an incredibly complex organism with specialized cells, tissues, and organs. Genes don't act in isolation. They influence each other, turning other genes on or off in intricate cascades. A GRN is essentially a map of these genetic interactions, showing which genes regulate which others, ultimately leading to specific cellular functions and observable traits. Biologists use these networks to understand development, disease, and the fundamental workings of life.

The Methodology: Building an "AI Microscope"

To tackle the challenge of understanding LLMs, researchers are developing tools analogous to those used in biology to map complex systems. At the heart of this new methodology is the idea of charting out the internal computational pathways.

What are Attribution Graphs?

Taking a cue from concepts like GRNs, an Attribution Graph is, at its core, a map of influence within a Transformer model.

Nodes in this graph represent the key components of the model (e.g., specific attention heads in particular layers, or MLP layers).
Edges (the connections between nodes) represent the strength and direction of influence. An edge from component A to component B means A significantly contributes to B's activation or behavior.

By constructing these graphs, researchers can start to trace the pathways of information and causation through the network. They can see how an input (like a prompt) activates certain components, which in turn influence others, leading step-by-step to the model's output. It’s like finally getting a circuit diagram for a complex electronic device, allowing us to see how different parts work together to achieve a function.

How Did They Do It?

The creation and analysis of these attribution graphs involve several sophisticated techniques. Have tried to explain them with an analogy. However it is still a very high level overview. I highly recommend the reader to go through the papers to get more details.

Extracting Interpretable Features: A crucial first step is to identify meaningful concepts within the model. One method involves training sparse autoencoders on the internal activations of models like Claude Sonnet. These autoencoders learn to represent the LLM's internal states using a much smaller, more interpretable set of "features." In simpler terms : Imagine you have a giant paragraph (the LLM's internal activation). It's hard to quickly grasp its main points.A sparse autoencoder acts like a summarizer. It reads the giant paragraph. It identifies a few keywords (the "sparse features") that capture the essence of the paragraph. It then tries to rewrite the original paragraph using only those few keywords. If the rewrite is good, it means those few keywords are indeed very important. Researchers can then study these "keywords" (the sparse features) to understand what core concepts the LLM is using in its internal "thoughts." This helps demystify what's happening inside the LLM.
Cross-Layer Transcoders (CLT) and Replacement Models: To make the complex model more tractable, researchers build an interpretable "replacement model." This often involves using Cross-Layer Transcoders (CLTs), which help extract features and understand their interactions across different layers of the model. Continuing our previous example - LLM isn't just one paragraph, but a whole sequence of connected paragraphs, like chapters in a book. Each "paragraph" is the activation state of a different layer, building on the previous one. CLTs are like specialized literary analysts who study the connections between different paragraphs (layers). They don't just summarize one paragraph. Instead, they try to figure out: "How do the 'keywords' we found in Paragraph 1 (an early layer) get transformed or combined to produce the 'keywords' we see in Paragraph 5 (a later layer)?". Researchers then build an "interpretable replacement model."(CliffsNotes or a detailed study guide for a specific section of the book) - making a section of the complex "book" (LLM) much easier to comprehend.
Feature Visualization and Labeling: Once features are extracted, they are visualized and labelled with human-interpretable concepts – from concrete objects to abstract ideas. This is like finding all the sentences (input examples) where this particular "keyword" is most strongly "lit up" or active within the LLM's "thoughts." By looking at many such examples, they might see that this keyword consistently appears when the LLM is discussing, say, "ancient stone castles," "dogs playing in a park," or even more abstractly, "a sense of injustice." Then, based on these observations and the common theme across all these examples, they "label" the keyword with a human-interpretable concept. So, an initially obscure numerical feature might be labeled as "Feature 123: 'Imagery of Medieval Architecture'" (a concrete object category) or "Feature 456: 'Concept of Betrayal'" (an abstract idea).
Linking Concepts into Circuits: Apart from identifying features, they are also linked to computational "circuits." This allows for tracing the pathway from input words to output words, showing how different concepts interact and build upon each other. Think of this as going beyond just identifying the themes ("keywords") in our book. Researchers try to discover the specific "narrative pathways" within the LLM. For instance, they might find that when the "input words" (the prompt) introduce a character described with terms that activate the "keyword" for 'ambition', this consistently leads to the activation of the "keyword" for 'risk-taking behaviour' a few "paragraphs" (layers) later. This, in turn, might frequently interact with an activated "keyword" for 'external threat', ultimately contributing to the "keyword" for 'heroic sacrifice' being prominent when the LLM generates its "output words." By tracing these connections, they are essentially drawing a "circuit diagram" of the story, showing how one concept (like 'ambition') computationally triggers and combines with other concepts (like 'risk-taking' and 'external threat') to build a coherent narrative or logical conclusion, effectively mapping the journey from the initial premise to the final output.
Perturbation Experiments: Inspired by neuroscience, researchers conduct perturbation experiments. They modify the internal states of the model (e.g., activating or deactivating specific features) and observe the resulting changes in output. This helps validate the hypothesized role of different features and circuits. Continuing with our analogy →
1. "Deactivate" a specific feature: Imagine our LLM is about to write a scene, and the "keyword" for "'Concept of Betrayal'" is naturally becoming active based on the story so far. The researcher might step in and say, "For this next passage, let's artificially suppress or mute the 'Betrayal' keyword. Don't let the LLM 'feel' or express betrayal right now."
2. "Activate" a specific feature: Conversely, they could say, "Even if the current story doesn't strongly suggest it, let's artificially boost the 'Hero's Courage' keyword here and see what happens."
It's like asking: "If we remove this thematic element (keyword) or alter this plot device (circuit), does the story change in the way we predicted?" This experimental approach allows researchers to confirm whether their understanding of how these internal concepts influence the LLM's final "narrative" (output) is accurate.

Key Discoveries: Peeking into the AI Mind

Let us now deep dive into some fascinating insights, that are highlighted in the papers →

Identifying and Eliciting Specific Concepts: Researchers found they could pinpoint and even activate specific concepts within the model. The most cited example is the "Golden Gate Bridge" feature: as the Anthropic team states, "When we activate a feature for the Golden Gate Bridge, our model writes about the Golden Gate Bridge." They've identified thousands of such features, ranging from other concrete entities like "the Eiffel Tower" to more abstract notions such as "gender." This demonstrates that the model internally represents these concepts in a way that can be isolated and triggered.
Multi-Step Reasoning and Sophisticated Planning: One very interesting insight was that LLMs aren't just reacting; they're planning.
- Forward Planning: A clear illustration of this is how Claude approaches creative writing. The Anthropic blog notes, "For example, when Claude writes poetry, it often plans out rhymes far in advance of using them." This indicates an ability to think several steps ahead. The blog provides another example: "when asked to complete the sentence 'The first U.S. president was...' it activates a feature for 'George Washington' before it writes his name."
- Backward Planning: The research also found evidence of models "working backward from goal states," a more complex form of planning where the model seems to start with a desired outcome and deduces the steps to get there.
Shared Conceptual Space Across Different "Languages": Demonstrating a surprisingly abstract level of understanding, the research found that models like Claude might develop a universal internal representation for similar concepts, even if they appear in very different domains. For instance, the Anthropic blog explains, "our model seems to use a shared conceptual space for English and Python code—even though these two 'languages' look very different on the surface." This suggests a deeper, underlying "language of thought." This abstract understanding isn't limited to natural language and code. We see similar shared representations for concepts that appear in, say, German and English, or in different programming languages like Python and Java."
Primitive "Metacognitive" Abilities : The models show early signs of understanding their own knowledge limitations. A common example cited by Anthropic is that "Claude’s default behavior is often to decline to speculate and only answer if it thinks it has sufficient information." This self-awareness, or primitive metacognition, is crucial for reliability.
The Double-Edged Sword: Fabrication and Coherence Exploits:
- Fabricating Plausible Arguments: The research uncovered that "Claude also seems to make things up. “We’ve caught our model fabricating plausible-sounding arguments, even when it knows its conclusion is false," as stated in the Anthropic blog.
- Exploitable Coherence for "Jailbreaks": A model's helpful tendency can sometimes be its vulnerability. The Anthropic blog points out that "Claude’s usually-helpful tendency to try to maintain grammatical coherence across turns of conversation can be exploited by certain 'jailbreak' attacks, which trick it into harmful responses."

Why Does This Matter?

The implications of this research are far-reaching:

Enhanced Interpretability and Trust: We can finally start to see how an LLM arrives at an answer, moving away from the "black box" paradigm. This transparency is fundamental to building trust. A deeper understanding of the model's internal workings can help us build more robust, reliable, and trustworthy AI systems.
Effective Debugging: When an LLM makes a mistake or behaves unexpectedly, these techniques could help pinpoint the internal cause.
Steerability and Control: This research could lead to more precise control over LLM behavior, guiding them towards more helpful, truthful, and safe outputs .Understanding the internal mechanisms helps in assessing the suitability of language models for various applications and identifying their limitations.

The Journey Ahead

Anthropic is quick to point out that this is still early-stage research. The complexity of these models means there's a long road ahead. However, the techniques presented in "Tracing 'Thoughts' in Language Models" represent a significant leap forward. It’s a pioneering effort in the quest for AI transparency and safety.

Be More Ambitious

Ankit Solanki — Wed, 14 May 2025 00:00:00 GMT

Building a OCR / Document AI pipeline used to be hard work.

Training OCR models
Building multi-step systems (eg: line segmentation, layout detection, table detection)
Ensuring that these steps works nicely with each other
Adding heuristics and special cases
Doing a lot of testing to ensure your baseline is good enough

Now: you could just give a PDF to Gemini Flash and get a 'good enough' output in one API call. What used to take weeks / months can now be done in hours.

With AI models becoming more capable — you can often get very far on your problem statement if you just try. But if you are stuck in an old mindset and think about some problems as difficult or time consuming, you may not even attempt harder problems!

I don't think realisation has sunk-in yet. We still follow old patterns of behaviour, we still mostly try to build the same things as in the past.

Here's another recent example: I need to parse something reliably. The 'right way' to do this is to write a tokeniser / parser, but this is time-consuming. In the past, I would have just depended on some basic regexes and would have just got it working (and over time — fixed edge cases as they came up).

This time though: I decided to do it in the 'right way'. I worked with an AI coding agent and got a tokeniser and recursive descent parser working in around an hour. The AI wasn't perfect — I definitely needed to know the theory and know when to give the right inputs. Even though this wasn't automatic, I ended up up with at least a 10x productivity boost.

The hard lesson for me personally was:

I needed to let go of my pre-conceived notions that a task can take days / weeks.
I had to decide to build something ambitious.

I need to keep reminding myself of the fact that hard things are now easy, that the impossible may be actually possible.

I think this is problem where younger folks have an edge over more experience people: with experience comes caution, but now is the time to let go of caution and just build.

To everyone reading:

I suggest partnering with one of the current frontier models
- Use Cursor, or Claude Code, or Windsurf, or v0, or Lovable, or anything else you want)
Try to build something ambitious today!

Beyond the Code: What

Vinay Shashidhar — Thu, 08 May 2025 00:00:00 GMT

We all have a pretty good idea of what makes for "Software Engineering Excellence". It’s a playbook refined over decades: write clean, maintainable code, test everything early and often, ship reliably, keep an eye on things in production, and then do it all over again. It’s a solid foundation.

ML excellence is software engineering excellence, plus data excellence, plus learning-system excellence. It inherits all that great software engineering DNA, but it also has to wrestle with a whole new set of challenges:

messy, ever-changing data,
models that slowly forget what they’ve learned (we call it "drift"),
the inherent fuzziness of statistics,
and a research landscape that moves at lightning speed.

The Cornerstones of Doing ML Engineering Right

Staying Ahead: Proactive Maintenance & Quality
- Don't let "technical debt" pile up – whether it's in your data pipelines, notebooks, or the infrastructure itself. Tackle it early.
- Models and even the prompts we use for them can "drift" over time, becoming less accurate. We need to watch for this and fix it.
- Garbage in, garbage out. Ensuring data is high quality, both where it comes from and as it flows through our systems, is paramount.
Building Smart: Systematic Execution & Automation
- If it's repetitive and boring, automate it! Think data validation checks, generating new features, or trying out countless model settings (hyper-parameter sweeps).
- Write everything down. Document your code, what version of a dataset you used, where your features came from, and what happened in each experiment.
- Keep an eye on everything: how fast your models respond (latency), how accurate they are and how much they cost to run. Set up alerts so the right people know when something’s off.
Always Learning: Continuous Improvement & Innovation
- The ML world changes fast. Keep up with new algorithms, tools, and even hardware that can make your systems better.
- Don't just build it and forget it. Constantly look for ways to improve your metrics, make things cheaper to run, or get answers faster.
Defining Success: Clarity & Precision
- Before you even start, figure out what "good" looks like. This means clear business goals (KPIs) and the technical model metrics that support them.
- Be upfront about what your model can't do. Point out its limitations, where biases might creep in, and any weird edge cases – ideally, before your users stumble upon them.

Where ML Feels Familiar to Software Engineers

A lot of this will sound familiar if you're from a software background:

Quality is King: We still obsess over tests, code reviews, and smooth CI/CD pipelines.
Automate Everything: Reproducible builds and one-click deployments.
Know What's Happening: Logs, metrics, alerts, and service level objectives (SLOs) are crucial.
Small Steps, Big Progress: We work in small batches, stay agile, and use things like feature flags.
If It's Not Written Down, It Didn't Happen: READMEs, decision records and design docs are essential.

But Here's Where ML Charts Its Own Course

While ML builds on software engineering, it introduces unique twists:

Dimension	Classical Software	ML Engineering
Source of Truth	Deterministic code	Data + algorithms that learn (stochastic)
How Things Break	Bugs, system outages	Bugs, outages, plus data changing unexpectedly, concepts drifting, and bias
Testing Focus	Unit/integration tests for code logic	Data validation, statistical tests, comparing offline vs. online (A/B), shadow deployments
What You Deploy	A binary or container	Model weights, feature store setup, training code, a snapshot of the training data
Lifespan	Code often stays stable once shipped	Model performance naturally degrades; retraining is a normal part of life
The Team You Need	Software Engineers + DevOps	Software Engineers + ML Engineers + Data Scientists + Analysts + Domain Experts
Fixing Mistakes	Rollbacks to a previous version	Rollbacks, plus quick data filters, staged feature roll-outs, model "gates"
The Toolbox	Git, CI, CD	Git and tools like MLflow, Feature Stores, Experiment Trackers

Let's understand a few of those key differences:

Data is like a whole other codebase you didn't write. Every time your dataset updates, it's as if code changes. This adds a huge layer of complexity.
"It works" isn't always a yes/no answer. Because ML deals with probabilities, You need to think in terms of statistical confidence.
Models get old. Unlike a piece of software that might run unchanged for years, models age. You're signing up for a long-term relationship: retraining them, possibly re-labelling data, and constantly re-evaluating their performance.
The stakes are higher when decisions are automated by machines. Ethical considerations like fairness and avoiding bias, along with privacy and regulatory compliance, become central, not just afterthoughts.

To excel in ML engineering, you need strong software skills, to manage data carefully, and to guide AI learning effectively. It’s a challenging, rapidly evolving field, but getting it right means building powerful, reliable, and responsible AI.

AI Agents could be true

Ankit Solanki — Tue, 29 Apr 2025 00:00:00 GMT

In HTTP, browsers are called ‘user agents’ — they were meant to act on behalf of users. This initial positioning has sort-of carried forward even now:

Browsers are one of the few pieces of software left that allow plugins & addons, that allow customisability.
Browsers let users override choices. You can set your own min font size, overriding any site’s CSS. You can use reader modes.
Browsers let users inspect applications via devtools, script applications by allowing users to run code on a developer console.

Browsers are one of the last vestiges of an old-school way of building software: software that is extensible, that users can control and change to better suit their needs.

Most current software we use today is the opposite.

Most software (whether SaaS, mobile apps or desktop apps) is not programmable.
Decisions are taken by product builders, not users. If a product owner hasn’t added a feature that you want, you can’t go and add it yourself!
Most software doesn’t empower users, it enforces rigid workflows.
We have taken away most customisability. Each option in a SaaS product is expensive to support, and it’s rational as product builders to reduce surface area and remove options that 99.9% of your users don’t use.

All of this happens because today’s software users aren’t programmers: they don’t know the internals of how software, operating systems work. Most software is built to be used by the mass market and as a result it encodes a set of assumptions about how it's meant to be used.

AI agents could reverse this trend.

I can see a world where users have personal AI agents running on their behalf. It wouldn’t matter if a product is scriptable or not, or if it has an API or not — users could just ask the agent to automate a task on their behalf! The agent may use the API, or it may just resort to spinning up a browser and clicking on pixels.

What happens when most users are able to finally take control of the tools that they use daily? I think it's finally time for general purpose compute to be actually accessible by the masses.

This fills me with optimism; similar to the heady days of Web 2.0: when everyone wasn’t so jaded about empowering users, exposing APIs was the norm, mashups were possible and products like Yahoo Pipes were built.

This is a perspective to keep in mind as we build AI-first products — let's enable our users to do great things!

A Beginner

Vinay Shashidhar — Tue, 15 Apr 2025 00:00:00 GMT

A large part of our daily work involves us regularly interacting with AI. Whether it is someone writing code, creating documents, generating images or simply drafting emails, AI has become an integral part of our lives. And if recent trends are any indication, their influence is going to grow even more. It makes even more sense to step back and reflect on whether we are unlocking their full potential.

We all have faced instances when you ask for something simple, and get a confusing, generic, or just plain wrong answer. The good news is, it's often not the AI's fault – it's how we're asking. Welcome to the world of Prompt Engineering – the art and science of crafting instructions that help AI understand exactly what you want. Think of it like giving directions. Vague directions lead to getting lost. Clear, specific directions get you to your destination efficiently.

We will be referring to the concepts mentioned in the article here. For additional insights, explore the original source materials on this topic.

What is Prompt Engineering Anyway?

Think of AI like a smart prediction machine. It guesses the next word based on what you give it. A "prompt" is simply your instruction. Prompt engineering is the skill of writing good instructions to get the best possible results from the AI. At its core, it's about communicating clearly with AI systems. Well-crafted prompts can save time and produce more useful results.

What Actually Goes Into a Prompt?

A prompt can have several parts working together.

The Instruction/Task: What do you want the AI to do? (e.g., "Summarize," "Translate," "Write," "Explain," "Generate code"). This is the core action.
Input Data: The specific text, data, or topic the AI needs to work on (e.g., the article to summarize, the sentence to translate).
Output Indicator/Format: How do you want the answer presented? (e.g., "in bullet points," "in a friendly tone," "limit to 100 words").
Context Setting and Specificity: Giving the AI background helps it understand the 'why' and 'who' behind your request. For example: Instead of saying "Write an email about the project update," try giving more specifics: "Write a short, professional email to the marketing team summarizing the key results from the Q3 social media campaign project update. Mention the 15% increase in engagement."

You don't always need all parts, but combining them thoughtfully makes your prompts much more powerful.

Setting the Rules: Using Constraints Effectively

Constraints help focus the AI's response in specific directions, similar to how parameters narrow down search results

Length: "Summarize this in 50 words"
Format: "List the ideas as bullet points"
Tone/Style: "Explain this in a formal tone"
Negative Constraints: "Don't use technical jargon," "Avoid mentioning price"
Real-world example:
- Less effective: "Suggest team-building ideas."
- More effective: "Suggest 3 low-cost team-building activity ideas for a remote team of 10 software engineers. Focus on activities that encourage collaboration and take about 1 hour. Present them as a numbered list with a brief description for each. Avoid virtual escape rooms."

The Power of Iterative Refinement

Prompt engineering is often a process of trial, error, and refinement – think of it as a conversation rather than a one-time command:

Start Simple: Write your initial prompt based on your goal.
Analyze the Output: Did the AI understand? Was the result accurate? Was the format correct?
Identify Gaps: What was missing or wrong? Was the prompt too vague? Did it lack context? Did you forget a constraint?
Refine the Prompt: Adjust your prompt based on your analysis. Add specificity, context, constraints, or examples.
Repeat: Try the new prompt and continue refining until you get the desired result.

Prompting Frameworks

For those of us who want to delve deeper into formal prompting techniques, here are several powerful approaches with practical examples:

Basic Prompt (Zero-Shot) – This is the simplest approach with no examples:
"Summarize this meeting transcript and list the action items"
Adding Specificity and Constraints – Adding details about what you want:
"Summarize the key decisions made in the following meeting transcript in 3-4 bullet points. Then, list all action items mentioned using the format: "- [Action Item]: [Owner] - Due: [Deadline]""
Role Prompting – Asking the AI to adopt a specific perspective:
"Act as an efficient Project Manager reviewing a meeting transcript. Your goal is to quickly understand outcomes and track tasks. First, summarize the key decisions made in the following meeting transcript in 3-4 concise bullet points. Second, rigorously extract all action items assigned. List them using the format: "- [Action Item]: [Owner] - Due: [Deadline]". If an owner or deadline isn't explicitly mentioned, use 'TBD'."
Chain of Thought – Guiding the AI through a step-by-step reasoning process:
"Act as an efficient Project Manager reviewing a meeting transcript. Your goal is to quickly understand outcomes and track tasks. Before generating the final summary and action list, follow these steps:
1. Read through the transcript to identify the main topics discussed.
2. Pinpoint the key decisions reached for each topic.
3. Scan the transcript specifically for commitments or tasks assigned (action items). Note down the task, who is assigned (Owner), and any mentioned deadline.
4. Consolidate the key decisions into a 3-4 bullet point summary.
5. Format the extracted action items clearly using: "- [Action Item]: [Owner] - Due: [Deadline]" (Use 'TBD' if owner/deadline is unclear).
Now, provide the final output based on this process."
Reflections – Adding a layer of self-evaluation to the AI's process:
"After extracting the action items using the CoT steps, re-read the summary of key decisions. Does each action item clearly map to a decision or discussion point? If not, double-check the transcript for that action item's context. Also, generate two versions of the action item list based on slightly different interpretations of the transcript, and present the most likely version based on the discussion flow."

Wrapping Up: The Art of Clear Communication

Prompt engineering is fundamentally about clear communication with AI systems. Specific instructions, relevant context, occasional examples, and format guidelines typically lead to better results.
Many users find an iterative approach works well: starting with simpler prompts and gradually refining them based on results.

Helpful Shortcut: Consider asking the AI itself to help craft effective prompts. For example, you might type "What would be a good prompt to generate a marketing email about our new product launch?" The AI can often suggest well-structured prompts tailored to your specific needs.

Evals are your IP

Satwik Gokina — Fri, 11 Apr 2025 00:00:00 GMT

"Evals are your IP" — Alexander Bricken, Anthropic

"We will achieve every evaluation we can state." — Zhengdong Wang, Deepmind

“The boring yet crucial secret behind good system prompts is test-driven development. You don't write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.” — Amanda Askell, Anthropic

The last quote is by the person who writes Claude’s system prompts. Why are state-of-the-art researchers highlighting the age-old concept of test-driven development (TDD)? Haven’t we moved on to discussing more trendy things like Agents and AGI?

The truth is test-driven development is more important to AI systems than traditional software systems. TDD is now called Evaluations (or Evals) when applied to AI systems—this isn't merely rebranding for hype. Evals are subtly different from tests.

What are Evals?

Any assessment of the performance of an AI system is called an Eval.

Input	Expected output/ behavior
1 + 1	2
How to compare the old vs new regime?	Answer should include this link: [link to the old vs new regime widget]
Hi	Any warm greeting of the user by name
What is 80C?	Should have the same information as this gold standard answer: “80C is an income tax section …”
My PAN number is AAPA0001	The system should use the `update_context` tool with argument ‘AAPA0001’

Only the first two resemble traditional unit tests. The next two evals require subjective interpretation. The final Eval addresses the system's behavior rather than its output.

How do we eval the Evals?

The ways to execute evals can vary significantly.

The simplest form involves basic operations like comparisons (==, !=, <>), regex operations, or even inspecting and asserting function calls. These resemble traditional unit tests.

Another source of evals is customers—has the customer marked the conversation as helpful, given a thumbs-up, thumbs-down, etc.?

The new approach, called LLM as a judge, leverages LLMs for evals requiring subjective interpretation.

An LLM can judge whether the greeting is “warm and friendly.”
An LLM can compare the provided answer to a gold-standard answer and rate them on completeness, tone, verbosity, etc.

Evaluating probabilistic systems

Traditional unit tests are pass or fail. We do not deploy code if even one test fails. However, AI system outputs are probabilistic. Previously, we even described evaluation methods that are not deterministic.

Instead, we run the eval multiple times and track the success rate as a percentage. Ideally, we should run it 30 times for statistical significance, though 10 is also acceptable.

Summary: Traditional unit tests—Pass or Fail; AI systems with Evals—% Success.

How good should these evals be?

A common reaction to seeing initial evals is that they are simple and don't capture the system's complexity.

However, evals only need to be meaningful, quick, and inexpensive. It’s encouraged to write many evals testing diverse aspects.

Ensuring diverse evals means core functionality breaks will likely affect lower-level functionalities too. This fundamental principle of unit testing applies here as well.

Why?

Why emphasize evals so much? Here is what we gain:

Testing new models

Tomorrow, if a new model called Gehri Sooch™ launches, all we need to do is update the LLM API to the new model and execute the eval report.

We benefited from this last consumer season. Because we had evals in place, we confidently switched from GPT-4o to GPT-4o-mini in the season's last week, reducing costs by 4x.

Faster iteration

Evals give us confidence to iterate rapidly. Major reworks and integration of new capabilities become manageable as long as your evals are trustworthy. Faster iterations drive quicker product success.

Continuous learning

Whenever a failure occurs in an AI system, translate that failure state into a set of evals. This ensures constant tracking.

Evals as your IP

Imagine a project running for a year—you likely have hundreds or thousands of evals by now. All failure states have been identified and translated into evals.

In the AI era, code is cheap—a tax Q&A bot can be built by an engineer in just an hour. What is valuable is understanding when and how AI fails. Those evals inform your system design.

In the future, imagine an LLM agent iteratively rewriting itself until the eval score surpasses a defined threshold. Thus, your system can be derived solely from evals. This is why evals are your IP.