<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Clear AI Blog</title>
    <link>https://cleartax.in/ai/</link>
    <description>Explore the latest insights and articles about AI from Clear.</description>
    <language>en</language>
    <lastBuildDate>Thu, 11 Dec 2025 12:59:10 GMT</lastBuildDate>
    <atom:link href="https://cleartax.in/ai/rss.xml" rel="self" type="application/rss+xml"/>
    
    <item>
      <title>MCP Resources Are Underrated</title>
      <link>https://cleartax.in/ai/posts/mcp-resources-are-underrated/</link>
      <guid isPermaLink="true">https://cleartax.in/ai/posts/mcp-resources-are-underrated/</guid>
      <pubDate>Thu, 11 Dec 2025 00:00:00 GMT</pubDate>
      <author>Ankit Solanki</author>
      <description><![CDATA[If you]]></description>
      <content:encoded><![CDATA[
<article class="prose dark:prose-invert"><p>MCP is here to stay. It&#x27;s become a Schelling point for the community,
and I feel it will continue to gain traction. Just as I was writing
this, I noticed that Anthropic <a href="https://www.anthropic.com/news/donating-the-model-context-protocol-and-establishing-of-the-agentic-ai-foundation" target="_blank" rel="noopener">has donated stewardship</a> of the MCP
protocol to the Agentic AI Foundation.</p>
<p>We&#x27;ve been <a href="https://cleartax.in/ai/posts/mcp-integration-patterns" target="_blank" rel="noopener" title="Read about how we&#x27;re thinking about MCP integrations">thinking deeply</a> about our MCP integration story, and I
have recently started to build strong conviction that <strong>MCP Resources
are criminally underrated</strong>. To understand why resources are useful,
let&#x27;s first understand the problem with MCP tools.</p>
<h2 id="mcp-tools-have-a-context-problem">MCP Tools have a context problem</h2>
<p>A typical MCP integration works like this:</p>
<ul>
<li>The agent&#x27;s harness exposes MCP tools to the LLM provider</li>
<li>In response to a completion API, the LLM may invoke an MCP tool</li>
<li>This tool&#x27;s response is given back to the LLM</li>
</ul>
<p>An MCP server could expose <em>any</em> tool. For example: a Google Drive MCP
could be exposing files on a connected Google Drive; and these files
could wildly vary in sizes.</p>
<p>What happens if the tool response is huge?</p>
<p><img alt="Large tool responses are a problem" loading="lazy" width="0" height="0" decoding="async" data-nimg="1" style="color:transparent;width:100%;height:auto" src="https://assets1.cleartax-cdn.com/content-prod/images/mcp-tools-context-issue.png"/></p>
<p>Naive solution: send the whole response to the LLM? This is expensive:
maybe the tokens consumed here could be better used elsewhere? Maybe the
response is larger than the LLM supports and your agent crashes?</p>
<p><img alt="Truncating MCP tool responses" loading="lazy" width="0" height="0" decoding="async" data-nimg="1" style="color:transparent;width:100%;height:auto" src="https://assets1.cleartax-cdn.com/content-prod/images/mcp-tools-truncation.png"/></p>
<p>Another option is truncation. Either the MCP server itself could
truncate large results, or the agent harness should anyway truncate
results over certain sizes.</p>
<p>This gives you safety: MCP tool calls cannot consume unlimited tokens.
But this comes at the cost of context loss: who decides what to
truncate? What if the most critical information needed for the agent was
truncated?</p>
<h2 id="mcp-resources-are-a-solution">MCP Resources are a solution</h2>
<p>The MCP protocol has support for <a href="https://modelcontextprotocol.io/specification/2025-11-25/server/resources#resource-templates" target="_blank" rel="noopener">resource templates</a>:</p>
<blockquote>
<p>Resource templates allow servers to expose parameterized resources
using URI templates.</p>
</blockquote>
<p>Resources are exposed as URIs. Critically: a tool response could contain
resource links, and we have the agent harness access the <em>entire</em>
resource.</p>
<p><img alt="Tools returning MCP resource URIs" loading="lazy" width="0" height="0" decoding="async" data-nimg="1" style="color:transparent;width:100%;height:auto" src="https://assets1.cleartax-cdn.com/content-prod/images/mcp-use-resources-2.png"/></p>
<p>This means:</p>
<ul>
<li>Large responses from tools can be converted to resources</li>
<li>Agents and agent harnesses can inspect resources directly</li>
<li>Depending on your system design, you could <em>compose</em> these together
with other LLM tools and get to interesting emergent behaviour.</li>
</ul>
<p>Here&#x27;s an example:</p>
<pre><code class="language-js">// Request
{
  &quot;method&quot;: &quot;tools/call&quot;,
  &quot;params&quot;: {
    &quot;name&quot;: &quot;read_google_drive_file&quot;,
    &quot;arguments&quot;: {
      &quot;id&quot;: &quot;c85b4851-bd61-4352-84a2-6e0c6b6d3dce&quot;
    }
  }
}

// Response without Resources
{
  &quot;content&quot;: [
    {
      &quot;type&quot;: &quot;text&quot;,
      &quot;text&quot;: &quot;...... huge text&quot;
    }
  ]
}

// Response with Resources
{
  &quot;content&quot;: [
    {
      &quot;type&quot;: &quot;text&quot;,
      &quot;text&quot;: &quot;...... small text snippet&quot;
    },
    {
      &quot;type&quot;: &quot;resource_link&quot;,
      &quot;uri&quot;: &quot;gdrive://file-id-123&quot;
    }
  ]
}
</code></pre>
<p>This can enable composition:</p>
<ul>
<li>The agent&#x27;s harness could read the resource and save the file to disk</li>
<li>The agent could then use other <em>local</em> tools (eg: <code>read_file</code>, <code>grep</code>,
<code>execute_code</code>, etc) to further process this file.</li>
</ul>
<h2 id="mcp-resource-links">MCP Resource Links</h2>
<p>The MCP protocol in fact has a first-party way of doing this: <a href="https://modelcontextprotocol.io/specification/draft/server/tools#resource-links" target="_blank" rel="noopener">resource
links</a>:</p>
<blockquote>
<p>A tool MAY return links to Resources, to provide additional context or
data. In this case, the tool will return a URI that can be subscribed
to or fetched by the client.</p>
</blockquote>
<p>Most MCP servers don&#x27;t expose resources though. This part of the
protocol seems really unexplored and under-appreciated.</p>
<h2 id="addendum-code-execution">Addendum: Code Execution</h2>
<p>Anthropic recently wrote about a different pattern: <a href="https://www.anthropic.com/engineering/code-execution-with-mcp" target="_blank" rel="noopener">using code
execution to interact with MCP servers</a>. This is a really powerful
pattern, though I feel it&#x27;s something that <em>can</em> be combined with MCP
Resources.</p>
<ul>
<li>MCP servers are fast becoming a glue layer</li>
<li>Any given MCP server doesn&#x27;t know which type of client would connect
to it.</li>
<li>The server doesn&#x27;t know if the client is using code execution or just
LLM tool calls.</li>
<li>Thus, well-behaved servers need to be defensive about context usage
in tool calls, and large responses will end up being truncated
somehow.</li>
<li>Resources are the <em>officially blessed</em> way to expose data to clients.</li>
</ul>
<p>Code execution doesn&#x27;t remove the need for MCP resources. It makes MCP
resources even more useful!</p>
<hr/>
<p>We&#x27;re adopting this MCP resource-link pattern heavily in servers we
control. Our agents will support this pattern as well. Unfortunately,
MCP Server authors need to start adopting this pattern for it to gain
traction though.</p>
<p>We&#x27;re hoping that this post can spark a discussion about this.</p></article>
      ]]></content:encoded>
    </item>
    <item>
      <title>MCP Integration Patterns</title>
      <link>https://cleartax.in/ai/posts/mcp-integration-patterns/</link>
      <guid isPermaLink="true">https://cleartax.in/ai/posts/mcp-integration-patterns/</guid>
      <pubDate>Fri, 26 Sep 2025 00:00:00 GMT</pubDate>
      <author>Ankit Solanki</author>
      <description><![CDATA[How would you actually integrate MCP into your AI agent?]]></description>
      <content:encoded><![CDATA[
<article class="prose dark:prose-invert"><p>We&#x27;re building an agent system and I have been evaluating adding support for MCP.</p>
<p>We <em>really</em> care about our tool design to an unreasonable degree. I have personally spent days thinking about the right design for some of our tools.</p>
<p>I couldn&#x27;t help but over-think exactly how MCP should be exposed to our agents. At a high level, I felt that there could be two integration patterns — I would name them the <em>&#x27;meta tool pattern&#x27;</em> and <em>&#x27;materialised tools pattern&#x27;</em>.</p>
<h2 id="meta-tool-pattern">Meta Tool Pattern</h2>
<p><img alt="Meta Tool Pattern" loading="lazy" width="0" height="0" decoding="async" data-nimg="1" style="color:transparent;width:100%;height:auto" src="https://assets1.cleartax-cdn.com/content-prod/images/mcp-meta-tool-pattern.png"/></p>
<p>The meta tool pattern: the agent sees a few meta tools like &#x27;list available MCP servers&#x27;, &#x27;describe MCP server&#x27;, &#x27;invoke MCP tool&#x27;.</p>
<p>The agent can decide to explore the capabilities of an MCP server and invoke its tools, when necessary. This design is flexible and efficient, but also requires a more capable agent:</p>
<ul>
<li>This pattern uses less context by default. Tool definitions are lazily loaded only when required.</li>
<li>The &#x27;discovery&#x27; calls for individual MCP servers need to happen only once in a given chat thread.</li>
<li>There&#x27;s no guarantee that the agent will decide to explore the installed MCP servers though.</li>
</ul>
<h2 id="materialised-tools-pattern">Materialised Tools Pattern</h2>
<p><img alt="Materialised Tools Pattern" loading="lazy" width="0" height="0" decoding="async" data-nimg="1" style="color:transparent;width:100%;height:auto" src="https://assets1.cleartax-cdn.com/content-prod/images/mcp-materialised-tools-pattern.png"/></p>
<p>Here, whenever a MCP server is enabled – the system will automatically discover all available tools in the specific MCP server and eagerly copy them to the list of tools available to the agent. The agent sees all of these tools by default. Calling an MCP tool is just like calling any other tool.</p>
<ul>
<li>This pattern makes tools really explicit to the agent.</li>
<li>This comes at the cost of using additional context, even when it&#x27;s not necessary.</li>
<li>If multiple MCPs are installed, you may fast run out of context space.</li>
</ul>
<h2 id="evaluating-these-patterns">Evaluating these patterns</h2>
<p>If you were designing an agentic platform, which option would you choose? I did a survey of some existing open source systems and here&#x27;s what I found:</p>
<ul>
<li><a href="https://github.com/openai/codex" target="_blank" rel="noopener">codex-cli</a> also uses the materialised tools pattern</li>
<li><a href="https://github.com/sst/opencode" target="_blank" rel="noopener">opencode</a> uses the materialised tools pattern</li>
<li><a href="https://github.com/cline/cline" target="_blank" rel="noopener">Cline</a> uses a hybrid<!-- -->
<ul>
<li>All tools exposed by enabled MCP servers are copied to the system prompt</li>
<li>A single <code>use_mcp_tool</code> tool is used to invoke them</li>
</ul>
</li>
<li><a href="https://github.com/FoundationAgents/OpenManus" target="_blank" rel="noopener">OpenManus</a> also uses the materialised tools pattern.</li>
</ul>
<p>This was a surprise. I&#x27;m not sure why the existing implementations are so heavily skewed towards materialised tools!</p>
<h2 id="our-decision">Our Decision</h2>
<p>At this point I&#x27;m inclined to go with the <strong>meta tool pattern</strong> — it seems to make sense the system we&#x27;re building.</p>
<p>What I have noticed is that:</p>
<ul>
<li>Most MCP servers don&#x27;t have a great agent interface.</li>
<li>They expose too many fine grained tools with overlapping responsibilities.</li>
<li>They are really wasteful of context tokens.</li>
</ul>
<p>Most importantly: the meta tool pattern just intuitively feels like the <em>right</em> solution to me.</p>
<p>There are 100s of small details like this that go into building great products – and I have a feeling that these details are actually what differentiates your product when everyone is building on the same foundation models.</p></article>
      ]]></content:encoded>
    </item>
    <item>
      <title>The Mythical Agent Month</title>
      <link>https://cleartax.in/ai/posts/the-mythical-agent-month/</link>
      <guid isPermaLink="true">https://cleartax.in/ai/posts/the-mythical-agent-month/</guid>
      <pubDate>Fri, 27 Jun 2025 00:00:00 GMT</pubDate>
      <author>Ankit Solanki</author>
      <description><![CDATA[How much of a speedup do coding agents give you?]]></description>
      <content:encoded><![CDATA[
<article class="prose dark:prose-invert"><p>The <a href="https://en.wikipedia.org/wiki/The_Mythical_Man-Month" target="_blank" rel="noopener">Mythical Man Month</a> famously had the observation that adding
manpower to a project that&#x27;s behind schedule will often delay it even
further. Additionally, in <a href="https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.pdf" target="_blank" rel="noopener">&#x27;No Silver Bullet&#x27;</a> Fred Brooks further
states that:</p>
<blockquote>
<p>There is no single development, in either technology or management
technique, which by itself promises even one order-of-magnitude
improvement within a decade in productivity, in reliability, in
simplicity.</p>
</blockquote>
<p>How does this change with the advent of AI coding agents? Can coding
agents give us the mythical 10x speedup?</p>
<p>My thoughts below. I&#x27;ve divided this blog post into sections which argue
both for and against transformative change.</p>
<p>My position: coding agents are a fundamental shift in how we&#x27;ll build
software over the next few decades, and we&#x27;re overestimating the impact
in the short term while underestimating the impact in the long term.</p>
<h2 id="headwinds">Headwinds</h2>
<h3 id="vibe-coding-vs-writing-production-software">Vibe Coding vs Writing Production Software</h3>
<p>We should differentiate between &#x27;vibe coding&#x27; as a way to experiment /
prototype, and using AI to write production software. <a href="https://simonwillison.net/2025/Mar/19/vibe-coding/" target="_blank" rel="noopener">Simon
Willison</a> has an excellent post on this.</p>
<p>For production quality software, all basic tenets of software
engineering apply. Code reviews, tests, architecture designs, software
design reviews, etc.</p>
<p>Most of a senior engineer&#x27;s time is often spent in these activities, not
just writing code.</p>
<p>Coding agents help a great deal here, but you won&#x27;t see the same speedup
as pure vibe coding — where you build software <em>without reviewing the
code your agent writes</em></p>
<p>At least for now: coding agents aren&#x27;t good enough to fully autonomously
build features and ship them to production without human review.</p>
<h3 id="essential-complexity-vs-incidental-complexity">Essential Complexity vs Incidental Complexity</h3>
<p>As <a href="https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.pdf" target="_blank" rel="noopener">Fred Brooks pointed out</a> in the above essay, software development
consists of both essential complexity and incidental complexity.</p>
<p>Incidental complexity could be things like figuring out how to write a
Dockerfile, or learning how a specific library works, or dealing with
framework specific issues. Coding agents can be a huge help here.</p>
<p>Essential complexity is the core problem you&#x27;re trying to solve. Coding
agents can definitely help here, but you still need to pay close
attention — humans will remain the bottleneck here.</p>
<p><a href="https://en.wikipedia.org/wiki/Amdahl%27s_law" target="_blank" rel="noopener">Amdahl&#x27;s Law</a> basically gives a ceiling for the performance gain
that automation / parallelisation gives for any given task. You are only
as fast as your bottleneck.</p>
<h3 id="decision-fatigue-and-time-compression">Decision Fatigue and Time Compression</h3>
<p>Faster coding actually compresses timeframes and lets you focus on the
hard decisions, on the essential complexity. Coding agents let you focus
on substance of your problem.</p>
<p>But human capacity for deep thought is limited!</p>
<p>So now, your day to day working with AI coding tools is going to be a
series of hard-decisions that you need to think deeply upon, decisions
that require high amount of mental effort.</p>
<p>Decision fatigue is real. If you have to make a weeks&#x27; worth of hard
decisions in a day, your decision quality will suffer.</p>
<p>AI coding will exhaust you if you&#x27;re not careful. Human beings need to
be able to step back and think about problems. We need to go for walks,
ruminate on ideas and just wander through a problem space.</p>
<h3 id="effective-communication-user-skill">Effective Communication &amp; User Skill</h3>
<p>AI agents need engineers to be effective communicators, and this is a
problem. Most engineers aren&#x27;t the best communicators. Every great coder
isn&#x27;t automatically great at delegation.</p>
<p>Effective communication is a skill. Writing clearly is a skill. And
using coding agents effectively is a skill.</p>
<p>For example, here are two recent articles that go in-depth about the
craft of using AI agents to code:</p>
<ul>
<li><a href="https://ampcode.com/how-i-use-amp" target="_blank" rel="noopener">How I use Amp</a></li>
<li><a href="https://blog.nilenso.com/blog/2025/05/29/ai-assisted-coding/" target="_blank" rel="noopener">AI-assisted coding for teams that can&#x27;t get away with vibes</a></li>
</ul>
<p>Craftsmanship takes time. Skills take time to build. People who are
great engineers today won&#x27;t automatically be great at using coding
agents. Getting better will require deliberate practice, and approaching
this problem with a beginner&#x27;s mindset.</p>
<h3 id="headwinds-summary">Headwinds Summary</h3>
<p>Given that:</p>
<ul>
<li>Production software (currently) requires human oversight</li>
<li>Essential complexity remains</li>
<li>It will take time to learn how to use the coding agents effectively</li>
</ul>
<p>Is an immediate 10x improvement in velocity possible? It seems there is
truly no silver bullet.</p>
<h2 id="tailwinds">Tailwinds</h2>
<h3 id="ai-scaling-will-continue">AI Scaling will continue</h3>
<p>Agents will keep getting better. Underlying models will keep getting
better. We&#x27;ve learned to not bet against <a href="https://cleartax.in/ai/posts/will-scaling-continue" target="_blank" rel="noopener">scaling</a>.</p>
<p>According to one recent viral benchmark, <a href="https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/" target="_blank" rel="noopener">the length of tasks that AI can
do uninterrupted</a> is doubling every 7 months.</p>
<p>From my personal experience, I know that each recent big model release
(eg: Sonnet 3.5, Sonnet 3.7, Sonnet 4.0) has made building agents
easier. The LLMs are getting better at following instructions, at using
tools, at planning, and just at showing <em>agency</em>.</p>
<p>It&#x27;s hard to predict the future, but it&#x27;s definitely possible that soon,
a large majority of code written won&#x27;t need human review and oversight.</p>
<h3 id="most-work-isnt-deep-work">Most work isn&#x27;t &#x27;Deep Work&#x27;</h3>
<p>While my arguments above hold (essential complexity remains, decision
fatigue is real) — let&#x27;s be real and acknowledge the fact that most of
us don&#x27;t do deep work 100% of the time.</p>
<p>A lot of time goes into glue work, into getting various subsystems to
behave, dealing with broken tools, etc.</p>
<p>AI can be a huge accelerator for these types of work. It&#x27;s possible that
this itself can be a 10x improvement for many organisations!</p>
<h3 id="quantity-has-its-own-quality">Quantity has its own Quality</h3>
<p>If you&#x27;re working in deep tech, if you&#x27;re building something complex —
AI coding allows you to try more approaches. You can build quick and
dirty throwaway prototypes, and validate more ideas.</p>
<p>Quantity has a quality of its own. If you can do more iterations, you
can get to better decisions. If you can actually build multiple
candidate systems, you can make more informed choices — architecture
/ design decisions become easier with data.</p>
<p>I have personally seen this pay off: while building a zero to one
product, I have been able to do many parallel experiments and actually
test 10s of ideas before deciding upon a plan. This has enabled me to
make bold system design bets with high confidence.</p>
<h3 id="ambition-moonshots">Ambition &amp; Moonshots</h3>
<p>AI agents allow you to <a href="https://cleartax.in/ai/posts/be-more-ambitious" target="_blank" rel="noopener">be more ambitious</a>. On the margin, with
higher productivity it&#x27;s possible to devote more time and resources
towards building something <em>better</em> than you would have previously.</p>
<p>You can think of a productivity gain as either:</p>
<ul>
<li>Work 10x faster</li>
<li>Build something 10x better</li>
</ul>
<p>Either option is fine! In fact, it may be the case that building
something 10x better is actually more impactful.</p>
<p>I suspect one of the impacts of ubiquitous coding agents will be the
rising baseline quality of software!</p>
<h3 id="tailwinds-summary">Tailwinds Summary</h3>
<p>If you consider the facts that:</p>
<ul>
<li>LLMs will continue to improve</li>
<li>We&#x27;ll all get better at using AI agents</li>
<li>We&#x27;ll be able to automate away low impact work</li>
<li>We&#x27;ll be able to try many more iterations</li>
<li>We&#x27;ll be able to build more impactful, more meaningful software</li>
</ul>
<p>How can you doubt the impact that coding agents will have?</p>
<h2 id="conclusions">Conclusions</h2>
<p>I&#x27;ve argued both sides of this. My position is:</p>
<ul>
<li>We&#x27;re radically underestimating coding agents</li>
<li>Most of us are not ready to adopt agents at scale</li>
</ul>
<p>I think impact and adoption will not be uniform. People will have
different lived experiences with AI tools, with some dismissing AI
coding as a fad, and some enthusiastically thinking of these agents as a
panacea for all their problems.</p>
<p>I think coding agents will have a huge impact that is overrated in the
short term, but underrated in the long term.</p>
<p>And I think that today, to get the most out of current generation
agents, you have to really dive deep and uncover their limits yourself.</p></article>
      ]]></content:encoded>
    </item>
    <item>
      <title>General Purpose vs Specialised AI Agents</title>
      <link>https://cleartax.in/ai/posts/general-agents-vs-specialised-agents/</link>
      <guid isPermaLink="true">https://cleartax.in/ai/posts/general-agents-vs-specialised-agents/</guid>
      <pubDate>Tue, 10 Jun 2025 00:00:00 GMT</pubDate>
      <author>Ankit Solanki</author>
      <description><![CDATA[Should agents be general purpose or should they be built for a specific task?]]></description>
      <content:encoded><![CDATA[
<article class="prose dark:prose-invert"><p>Should AI agents be general purpose or should they be built for a specific task?</p>
<p>I would say <strong>yes to both</strong>.</p>
<h3 id="special-purpose-agents">Special Purpose Agents</h3>
<p>A special purpose agent is an agent built to do one (or a few things) very precisely. Examples:</p>
<ul>
<li>RAG on a knowledge base and answer questions <em>only from the knowledge base</em>.</li>
<li>Convert images to very specific structured data format (eg: convert image to a contact vCard).</li>
<li>Convert text to SQL on a specific table, with specific guardrails (eg: always apply page size limits).</li>
</ul>
<p>These agents are built to do a specific job.</p>
<h3 id="general-purpose-agents">General Purpose Agents</h3>
<p>A general purpose agent on the other hand would be an agent that has <em>emergent behaviour</em>, that can use its tools to do something it wasn&#x27;t explicitly programmed for.</p>
<p>This is best demonstrated by an example. I recently coded up a toy agent that runs on the command line that had the following tools available:</p>
<ul>
<li>List directory</li>
<li>Read file</li>
<li>Query file (via <a href="https://duckdb.org/" target="_blank" rel="noopener">duckdb</a>)</li>
<li>Read PDF</li>
</ul>
<p>This agent was designed to have generic tools, and it was allowed to do multiple tool calls if needed. It wasn&#x27;t given a specific goal.</p>
<p>The lack of specificity actually made the agent more useful! In the last few days, I have used it to do:</p>
<ul>
<li>K8S cost optimisation — given a CSV containing some kubernetes utilisation data, this agent helped me find low hanging cost optimisation options</li>
<li>Data entry for my own personal finance needs — given some account statements, the agent was able to help me digitise them in a format I use for tracking my expenses</li>
<li>Do Q&amp;A on my meeting notes</li>
</ul>
<p>Interestingly, the agent showed a lot more <em>agency</em> than I was expecting. For example, it would often execute multiple tool calls in response to a general &#x27;hello&#x27; message!</p>
<p><img alt="Response to a &#x27;hello&#x27; message" loading="lazy" width="0" height="0" decoding="async" data-nimg="1" style="color:transparent;width:100%;height:auto" src="https://assets1.cleartax-cdn.com/content-prod/images/agent-hello.png"/></p>
<p>All the recent wow moments I have had with AI are usually with general purpose agents. They can end up giving you unexpectedly rich experiences.</p>
<h3 id="use-cases-trade-offs">Use Cases &amp; Trade-offs</h3>
<p>If you&#x27;re building an AI enabled product, both special purpose agents and general purpose agents have their place.</p>
<ul>
<li>
<p>You sometimes want determinism (or close to determinism) and repeatability.</p>
<ul>
<li>For example: if you&#x27;re building a support desk, you may want to always categorise tickets in a certain way.</li>
<li>In such scenarios, special purpose agents are really useful. You can treat them as &quot;intelligence that&#x27;s an API call away&quot;.</li>
</ul>
</li>
<li>
<p>Special purpose agents are limited in scope.</p>
</li>
<li>
<p>General purpose agents are where the power of AI agents becomes apparent.</p>
</li>
<li>
<p>General purpose agents are going to be more expensive to build.</p>
<ul>
<li>You may need smarter LLMs, you may need to spend tokens on reasoning.</li>
<li>General purpose agents could use up millions of tokens.</li>
</ul>
</li>
<li>
<p>Shipping general purpose agents requires you to have the right underlying infrastructure.</p>
<ul>
<li>The agent is as powerful as the tools it has access to. You need to build the right primitives to unshackle the AI model.</li>
<li>You need to build the right platform for the AI agents to leverage!<!-- -->
<ul>
<li>For example: if your system processes a lot of data, you may need to invest in robust parallel execution.</li>
<li>If the agent could &#x27;ask any question&#x27; of your data, you have to think about indexing and database performance.</li>
</ul>
</li>
</ul>
</li>
<li>
<p>The ideal end state of a general purpose agent is a <a href="https://arxiv.org/abs/2402.01030" target="_blank" rel="noopener">coding agent</a> — an agent that can write code for a specific task and then execute it.</p>
<ul>
<li>This could lead to potential security issues, and you have to look out for other types of abuse.</li>
<li>You might need to invest in primitives like sandboxing here.</li>
</ul>
</li>
</ul>
<p>I see this as a continuum — the more general an agent, the more powerful it is, but you need to invest proportionally into building the right safeguards.</p>
<p>Most use cases may not need a general purpose agent. You can probably start off building an AI-enabled product by just focusing on special purpose agents.</p>
<p>General purpose agents are also the most valuable ones though. And building general purpose agents will require you to invest proportionally into your platform&#x27;s core primitives.</p></article>
      ]]></content:encoded>
    </item>
    <item>
      <title>Peeking Inside the AI Mind</title>
      <link>https://cleartax.in/ai/posts/peeking-inside-the-ai-mind/</link>
      <guid isPermaLink="true">https://cleartax.in/ai/posts/peeking-inside-the-ai-mind/</guid>
      <pubDate>Wed, 28 May 2025 00:00:00 GMT</pubDate>
      <author>Vinay Shashidhar</author>
      <description><![CDATA[Anthropic]]></description>
      <content:encoded><![CDATA[
<article class="prose dark:prose-invert"><p>We&#x27;ve all been amazed, and perhaps a little nervous also If I may say, by the capabilities of Large Language Models (LLMs). They can write poetry, generate code and answer complex questions. For a piece of code, supposedly, doing next token prediction, it surely excels at a lot of impressive tasks. That begs the question - <em>do they actually just predict the next token? Are they planning ahead of time? How do they actually do it?</em> <em>Is the reasoning in the “reasoning model” an afterthought, or did the LLM actually follow the sequence of steps?</em></p>
<p>This opacity isn&#x27;t just a matter of academic curiosity. It has real-world implications for safety, reliability, and our ability to trust these powerful tools. If we don&#x27;t understand why an LLM says what it says, how can we be sure it&#x27;s not biased, is not generating misinformation or more importantly we can’t control what it says.</p>
<p>Anthropic has taken a significant step towards demystifying these complex systems. I would highly recommend everyone to at least read their <a href="https://www.anthropic.com/research/tracing-thoughts-language-model" target="_blank" rel="noopener">blog</a>. They have also published a couple of papers - <a href="https://transformer-circuits.pub/2025/attribution-graphs/methods.html" target="_blank" rel="noopener">Circuit Tracing: Revealing Computational Graphs in Language Models</a> and <a href="https://transformer-circuits.pub/2025/attribution-graphs/biology.html" target="_blank" rel="noopener">On the Biology of a Large Language Model.</a>  I am personally very surprised why more people are not talking about it.</p>
<h3 id="object-object"><strong>The Challenge: Understanding the Unseen</strong></h3>
<p>Imagine trying to understand <a href="https://news.harvard.edu/gazette/story/2015/10/how-the-brain-builds-new-thoughts/" target="_blank" rel="noopener">how a human brain forms a thought</a> - while we have some groundbreaking research in this area, we still have a lot to learn. Our brain is an incredibly complex network of neurons firing in intricate patterns. LLMs, while inspired by neural networks, have their own unique architecture, often involving billions of parameters. Identifying specific, human-understandable concepts within these vast networks has been a very big challenge.</p>
<h3 id="object-object"><strong>The Inspiration: Looking to Life Itself</strong></h3>
<p>The researchers draw an analogy from the world of biology, specifically <a href="https://www.nature.com/articles/s41576-023-00618-5" target="_blank" rel="noopener">Gene Regulatory Networks</a> (GRNs). To understand this better let us think about how a single fertilized egg develops into an incredibly complex organism with specialized cells, tissues, and organs. Genes don&#x27;t act in isolation. They influence each other, turning other genes on or off in intricate cascades. A GRN is essentially a map of these genetic interactions, showing which genes regulate which others, ultimately leading to specific cellular functions and observable traits. Biologists use these networks to understand development, disease, and the fundamental workings of life.</p>
<h3 id="object-object"><strong>The Methodology: Building an &quot;AI Microscope&quot;</strong></h3>
<p>To tackle the challenge of understanding LLMs, researchers are developing tools analogous to those used in biology to map complex systems. At the heart of this new methodology is the idea of charting out the internal computational pathways.</p>
<h4 id="object-object"><strong>What are Attribution Graphs?</strong></h4>
<p>Taking a cue from concepts like GRNs, an <strong>Attribution Graph</strong> is, at its core, a map of influence within a Transformer model.</p>
<ul>
<li><strong>Nodes</strong> in this graph represent the key components of the model (e.g., specific attention heads in particular layers, or MLP layers).</li>
<li><strong>Edges</strong> (the connections between nodes) represent the strength and direction of influence. An edge from component A to component B means A significantly contributes to B&#x27;s activation or behavior.</li>
</ul>
<p>By constructing these graphs, researchers can start to trace the pathways of information and causation through the network. They can see how an input (like a prompt) activates certain components, which in turn influence others, leading step-by-step to the model&#x27;s output. It’s like finally getting a circuit diagram for a complex electronic device, allowing us to see how different parts work together to achieve a function.</p>
<h4 id="object-object"><strong>How Did They Do It?</strong></h4>
<p>The creation and analysis of these attribution graphs involve several sophisticated techniques. Have tried to explain them with an analogy. However it is still a very high level overview. I highly recommend the reader to go through the papers to get more details.</p>
<ul>
<li>
<p><strong>Extracting Interpretable Features:</strong> A crucial first step is to identify meaningful concepts within the model. One method involves training sparse autoencoders on the internal activations of models like Claude Sonnet. These autoencoders learn to represent the LLM&#x27;s internal states using a much smaller, more interpretable set of &quot;features.&quot; In simpler terms : Imagine you have a giant paragraph (<strong>the LLM&#x27;s internal activation</strong>). It&#x27;s hard to quickly grasp its main points.A sparse autoencoder acts like a summarizer. It reads the giant paragraph. It identifies a few keywords (<strong>the &quot;sparse features</strong>&quot;) that capture the essence of the paragraph. It then tries to rewrite the original paragraph using only those few keywords. If the rewrite is good, it means those few keywords are indeed very important. Researchers can then study these &quot;keywords&quot; (<strong>the sparse features</strong>) to understand what core concepts the LLM is using in its internal &quot;<strong>thoughts</strong>.&quot; This helps demystify what&#x27;s happening inside the LLM.</p>
</li>
<li>
<p><strong>Cross-Layer Transcoders (CLT) and Replacement Models:</strong> To make the complex model more tractable, researchers build an interpretable &quot;replacement model.&quot; This often involves using Cross-Layer Transcoders (CLTs), which help extract features and understand their interactions across different layers of the model. Continuing our previous example - LLM isn&#x27;t just one paragraph, but a whole sequence of connected paragraphs, like chapters in a book. Each &quot;paragraph&quot; is the activation state of a different layer, building on the previous one. CLTs are like specialized literary analysts who study the connections between different paragraphs (layers). They don&#x27;t just summarize one paragraph. Instead, they try to figure out: &quot;How do the &#x27;keywords&#x27; we found in Paragraph 1 (an early layer) get transformed or combined to produce the &#x27;keywords&#x27; we see in Paragraph 5 (a later layer)?&quot;. Researchers then build an &quot;interpretable replacement model.&quot;(CliffsNotes or a detailed study guide for a specific section of the book) - making a section of the complex &quot;book&quot; (LLM) much easier to comprehend.</p>
</li>
<li>
<p><strong>Feature Visualization and Labeling</strong>: Once features are extracted, they are visualized and labelled with human-interpretable concepts – from concrete objects to abstract ideas. This is like finding all the sentences (input examples) where this particular &quot;keyword&quot; is most strongly &quot;lit up&quot; or active within the LLM&#x27;s &quot;thoughts.&quot; By looking at many such examples, they might see that this keyword consistently appears when the LLM is discussing, say, &quot;ancient stone castles,&quot; &quot;dogs playing in a park,&quot; or even more abstractly, &quot;a sense of injustice.&quot; Then, based on these observations and the common theme across all these examples, they &quot;label&quot; the keyword with a human-interpretable concept. So, an initially obscure numerical feature might be labeled as &quot;Feature 123: &#x27;Imagery of Medieval Architecture&#x27;&quot; (a concrete object category) or &quot;Feature 456: &#x27;Concept of Betrayal&#x27;&quot; (an abstract idea).</p>
</li>
<li>
<p><strong>Linking Concepts into Circuits:</strong> Apart from identifying features, they are also linked to computational &quot;circuits.&quot; This allows for tracing the pathway from input words to output words, showing how different concepts interact and build upon each other. Think of this as going beyond just identifying the themes (&quot;keywords&quot;) in our book. Researchers try to discover the specific &quot;narrative pathways&quot; within the LLM. For instance, they might find that when the &quot;input words&quot; (the prompt) introduce a character described with terms that activate the &quot;keyword&quot; for &#x27;ambition&#x27;, this consistently leads to the activation of the &quot;keyword&quot; for &#x27;risk-taking behaviour&#x27; a few &quot;paragraphs&quot; (layers) later. This, in turn, might frequently interact with an activated &quot;keyword&quot; for &#x27;external threat&#x27;, ultimately contributing to the &quot;keyword&quot; for &#x27;heroic sacrifice&#x27; being prominent when the LLM generates its &quot;output words.&quot; By tracing these connections, they are essentially drawing a &quot;circuit diagram&quot; of the story, showing how one concept (like &#x27;ambition&#x27;) computationally triggers and combines with other concepts (like &#x27;risk-taking&#x27; and &#x27;external threat&#x27;) to build a coherent narrative or logical conclusion, effectively mapping the journey from the initial premise to the final output.</p>
</li>
<li>
<p><strong>Perturbation Experiments:</strong> Inspired by neuroscience, researchers conduct perturbation experiments. They modify the internal states of the model (e.g., activating or deactivating specific features) and observe the resulting changes in output. This helps validate the hypothesized role of different features and circuits. Continuing with our analogy →</p>
<ol>
<li>&quot;Deactivate&quot; a specific feature: Imagine our LLM is about to write a scene, and the &quot;keyword&quot; for &quot;&#x27;Concept of Betrayal&#x27;&quot; is naturally becoming active based on the story so far. The researcher might step in and say, &quot;For this next passage, let&#x27;s artificially suppress or mute the &#x27;Betrayal&#x27; keyword. Don&#x27;t let the LLM &#x27;feel&#x27; or express betrayal right now.&quot;</li>
<li>&quot;Activate&quot; a specific feature: Conversely, they could say, &quot;Even if the current story doesn&#x27;t strongly suggest it, let&#x27;s artificially boost the &#x27;Hero&#x27;s Courage&#x27; keyword here and see what happens.&quot;</li>
</ol>
<p>It&#x27;s like asking: &quot;If we remove this thematic element (keyword) or alter this plot device (circuit), does the story change in the way we predicted?&quot; This experimental approach allows researchers to confirm whether their understanding of how these internal concepts influence the LLM&#x27;s final &quot;narrative&quot; (output) is accurate.</p>
</li>
</ul>
<h3 id=""></h3>
<h3 id="object-object"><strong>Key Discoveries: Peeking into the AI Mind</strong></h3>
<p>Let us now deep dive into some fascinating insights, that are highlighted in the papers →</p>
<ul>
<li><strong>Identifying and Eliciting Specific Concepts:</strong> Researchers found they could pinpoint and even activate specific concepts within the model. The most cited example is the &quot;Golden Gate Bridge&quot; feature: as the Anthropic team states, &quot;When we activate a feature for the Golden Gate Bridge, our model writes about the Golden Gate Bridge.&quot; They&#x27;ve identified thousands of such features, ranging from other concrete entities like &quot;the Eiffel Tower&quot; to more abstract notions such as &quot;gender.&quot; This demonstrates that the model internally represents these concepts in a way that can be isolated and triggered.</li>
<li><strong>Multi-Step Reasoning and Sophisticated Planning</strong>: One very interesting insight was that LLMs aren&#x27;t just reacting; they&#x27;re planning.<!-- -->
<ul>
<li><strong>Forward Planning:</strong> A clear illustration of this is how Claude approaches creative writing. The Anthropic blog notes, &quot;For example, when Claude writes poetry, it often plans out rhymes far in advance of using them.&quot; This indicates an ability to think several steps ahead. The blog provides another example: &quot;when asked to complete the sentence &#x27;The first U.S. president was...&#x27; it activates a feature for &#x27;George Washington&#x27; before it writes his name.&quot;</li>
<li><strong>Backward Planning:</strong> The research also found evidence of models &quot;working backward from goal states,&quot; a more complex form of planning where the model seems to start with a desired outcome and deduces the steps to get there.</li>
</ul>
</li>
<li><strong>Shared Conceptual Space Across Different &quot;Languages&quot;</strong>: Demonstrating a surprisingly abstract level of understanding, the research found that models like Claude might develop a universal internal representation for similar concepts, even if they appear in very different domains. For instance, the Anthropic blog explains, &quot;<em>our model seems to use a shared conceptual space for English and Python code—even though these two &#x27;languages&#x27; look very different on the surface</em>.&quot; This suggests a deeper, underlying &quot;<strong>language of thought.</strong>&quot; This abstract understanding isn&#x27;t limited to natural language and code. We see similar shared representations for concepts that appear in, say, German and English, or in different programming languages like Python and Java.&quot;</li>
<li><strong>Primitive &quot;Metacognitive&quot; Abilities</strong> : The models show early signs of understanding their own knowledge limitations. A common example cited by Anthropic is that &quot;Claude’s default behavior is often to decline to speculate and only answer if it thinks it has sufficient information.&quot; This self-awareness, or primitive metacognition, is crucial for reliability.</li>
<li><strong>The Double-Edged Sword: Fabrication and Coherence Exploits:</strong>
<ul>
<li><strong>Fabricating Plausible Arguments:</strong> The research uncovered that &quot;Claude also seems to make things up. “<em>We’ve caught our model fabricating plausible-sounding arguments, even when it knows its conclusion is false,</em>&quot; as stated in the Anthropic blog.</li>
<li><strong>Exploitable Coherence for &quot;Jailbreaks&quot;</strong>: A model&#x27;s helpful tendency can sometimes be its vulnerability. The Anthropic blog points out that <em>&quot;Claude’s usually-helpful tendency to try to maintain grammatical coherence across turns of conversation can be exploited by certain &#x27;jailbreak&#x27; attacks, which trick it into harmful responses.</em>&quot;</li>
</ul>
</li>
</ul>
<h3 id="object-object"><strong>Why Does This Matter?</strong></h3>
<p>The implications of this research are far-reaching:</p>
<ul>
<li><strong>Enhanced Interpretability and Trust:</strong> We can finally start to see how an LLM arrives at an answer, moving away from the &quot;black box&quot; paradigm. This transparency is fundamental to building trust. A deeper understanding of the model&#x27;s internal workings can help us build more robust, reliable, and trustworthy AI systems.</li>
<li><strong>Effective Debugging:</strong> When an LLM makes a mistake or behaves unexpectedly, these techniques could help pinpoint the internal cause.</li>
<li><strong>Steerability and Control:</strong> This research could lead to more precise control over LLM behavior, guiding them towards more helpful, truthful, and safe outputs .Understanding the internal mechanisms helps in assessing the suitability of language models for various applications and identifying their limitations.</li>
</ul>
<h3 id="object-object"><strong>The Journey Ahead</strong></h3>
<p>Anthropic is quick to point out that this is still early-stage research. The complexity of these models means there&#x27;s a long road ahead. However, the techniques presented in &quot;Tracing &#x27;Thoughts&#x27; in Language Models&quot; represent a significant leap forward. It’s a pioneering effort in the quest for AI transparency and safety.</p></article>
      ]]></content:encoded>
    </item>
    <item>
      <title>Be More Ambitious</title>
      <link>https://cleartax.in/ai/posts/be-more-ambitious/</link>
      <guid isPermaLink="true">https://cleartax.in/ai/posts/be-more-ambitious/</guid>
      <pubDate>Wed, 14 May 2025 00:00:00 GMT</pubDate>
      <author>Ankit Solanki</author>
      <description><![CDATA[Hard things are now easy. What can you build today?]]></description>
      <content:encoded><![CDATA[
<article class="prose dark:prose-invert"><p>Building a OCR / Document AI pipeline used to be hard work.</p>
<ul>
<li>Training OCR models</li>
<li>Building multi-step systems (eg: line segmentation, layout detection,
table detection)</li>
<li>Ensuring that these steps works nicely with each other</li>
<li>Adding heuristics and special cases</li>
<li>Doing a lot of testing to ensure your baseline is good enough</li>
</ul>
<p>Now: you could just give a PDF to Gemini Flash and get a &#x27;good enough&#x27;
output in one API call. What used to take weeks / months can now be done
in hours.</p>
<p>With AI models becoming more capable — you can often get very far on
your problem statement <em>if you just try</em>. But if you are stuck in an old
mindset and think about some problems as difficult or time consuming,
you may not even attempt harder problems!</p>
<p>I don&#x27;t think realisation has sunk-in yet. We still follow old patterns
of behaviour, we still mostly try to build the same things as in the
past.</p>
<p>Here&#x27;s another recent example: I need to parse something reliably. The
&#x27;right way&#x27; to do this is to write a tokeniser / parser, but this is
time-consuming. In the past, I would have just depended on some basic
regexes and would have just got it working (and over time — fixed edge
cases as they came up).</p>
<p>This time though: I decided to do it in the &#x27;right way&#x27;. I worked with
an AI coding agent and got a tokeniser and recursive descent parser
working in around an hour. The AI wasn&#x27;t perfect — I definitely needed
to know the theory and know when to give the right inputs. Even though
this wasn&#x27;t automatic, I ended up up with at least a 10x productivity
boost.</p>
<p>The hard lesson for me personally was:</p>
<ul>
<li>I needed to <em>let go</em> of my pre-conceived notions that a task can take
days / weeks.</li>
<li>I had to <em>decide</em> to build something ambitious.</li>
</ul>
<p>I need to keep reminding myself of the fact that <strong>hard things are now
easy</strong>, that the impossible may be actually possible.</p>
<p>I think this is problem where younger folks have an edge over more
experience people: with experience comes caution, but now is the time to
let go of caution and just build.</p>
<p>To everyone reading:</p>
<ul>
<li>I suggest partnering with one of the current frontier models<!-- -->
<ul>
<li>Use <a href="https://www.cursor.com/" target="_blank" rel="noopener">Cursor</a>, or <a href="https://docs.anthropic.com/en/docs/claude-code/overview" target="_blank" rel="noopener">Claude Code</a>, or <a href="https://windsurf.com/" target="_blank" rel="noopener">Windsurf</a>, or <a href="https://v0.dev/" target="_blank" rel="noopener">v0</a>, or <a href="https://lovable.dev/" target="_blank" rel="noopener">Lovable</a>, or
anything else you want)</li>
</ul>
</li>
<li>Try to build something ambitious today!</li>
</ul></article>
      ]]></content:encoded>
    </item>
    <item>
      <title>Beyond the Code: What </title>
      <link>https://cleartax.in/ai/posts/ml-engineering-excellence/</link>
      <guid isPermaLink="true">https://cleartax.in/ai/posts/ml-engineering-excellence/</guid>
      <pubDate>Thu, 08 May 2025 00:00:00 GMT</pubDate>
      <author>Vinay Shashidhar</author>
      <description><![CDATA[This article defines ML engineering excellence as an extension of software engineering excellence]]></description>
      <content:encoded><![CDATA[
<article class="prose dark:prose-invert"><p>We all have a pretty good idea of what makes for &quot;Software Engineering Excellence&quot;. It’s a playbook refined over decades: write clean, maintainable code, test everything early and often, ship reliably, keep an eye on things in production, and then do it all over again. It’s a solid foundation.</p>
<p><strong>ML excellence is software engineering excellence, plus data excellence, plus learning-system excellence.</strong> It inherits all that great software engineering DNA, but it also has to wrestle with a whole new set of challenges:</p>
<ul>
<li>messy, ever-changing data,</li>
<li>models that slowly forget what they’ve learned (we call it &quot;drift&quot;),</li>
<li>the inherent fuzziness of statistics,</li>
<li>and a research landscape that moves at lightning speed.</li>
</ul>
<h2 id="the-cornerstones-of-doing-ml-engineering-right">The Cornerstones of Doing ML Engineering Right</h2>
<ol>
<li>
<p><strong>Staying Ahead: Proactive Maintenance &amp; Quality</strong></p>
<ul>
<li>Don&#x27;t let &quot;technical debt&quot; pile up – whether it&#x27;s in your data pipelines, notebooks, or the infrastructure itself. Tackle it early.</li>
<li>Models and even the prompts we use for them can &quot;drift&quot; over time, becoming less accurate. We need to watch for this and fix it.</li>
<li>Garbage in, garbage out. Ensuring data is high quality, both where it comes from and as it flows through our systems, is paramount.</li>
</ul>
</li>
<li>
<p><strong>Building Smart: Systematic Execution &amp; Automation</strong></p>
<ul>
<li>If it&#x27;s repetitive and boring, automate it! Think data validation checks, generating new features, or trying out countless model settings (hyper-parameter sweeps).</li>
<li>Write everything down. Document your code, what version of a dataset you used, where your features came from, and what happened in each experiment.</li>
<li>Keep an eye on everything: how fast your models respond (latency), how accurate they are and how much they cost to run. Set up alerts so the right people know when something’s off.</li>
</ul>
</li>
<li>
<p><strong>Always Learning: Continuous Improvement &amp; Innovation</strong></p>
<ul>
<li>The ML world changes fast. Keep up with new algorithms, tools, and even hardware that can make your systems better.</li>
<li>Don&#x27;t just build it and forget it. Constantly look for ways to improve your metrics, make things cheaper to run, or get answers faster.</li>
</ul>
</li>
<li>
<p><strong>Defining Success: Clarity &amp; Precision</strong></p>
<ul>
<li>Before you even start, figure out what &quot;good&quot; looks like. This means clear business goals (KPIs) <em>and</em> the technical model metrics that support them.</li>
<li>Be upfront about what your model <em>can&#x27;t</em> do. Point out its limitations, where biases might creep in, and any weird edge cases – ideally, before your users stumble upon them.</li>
</ul>
</li>
</ol>
<h2 id="where-ml-feels-familiar-to-software-engineers">Where ML Feels Familiar to Software Engineers</h2>
<p>A lot of this will sound familiar if you&#x27;re from a software background:</p>
<ul>
<li><strong>Quality is King:</strong> We still obsess over tests, code reviews, and smooth CI/CD pipelines.</li>
<li><strong>Automate Everything:</strong> Reproducible builds and one-click deployments.</li>
<li><strong>Know What&#x27;s Happening:</strong> Logs, metrics, alerts, and service level objectives (SLOs) are crucial.</li>
<li><strong>Small Steps, Big Progress:</strong> We work in small batches, stay agile, and use things like feature flags.</li>
<li><strong>If It&#x27;s Not Written Down, It Didn&#x27;t Happen:</strong> READMEs, decision records and design docs are essential.</li>
</ul>
<h2 id="object-object"><strong>But Here&#x27;s Where ML Charts Its Own Course</strong></h2>
<p>While ML builds on software engineering, it introduces unique twists:</p>
<table><thead><tr><th style="text-align:left">Dimension</th><th style="text-align:left">Classical Software</th><th style="text-align:left">ML Engineering</th></tr></thead><tbody><tr><td style="text-align:left"><strong>Source of Truth</strong></td><td style="text-align:left">Deterministic code</td><td style="text-align:left">Data + algorithms that learn (stochastic)</td></tr><tr><td style="text-align:left"><strong>How Things Break</strong></td><td style="text-align:left">Bugs, system outages</td><td style="text-align:left">Bugs, outages, <strong>plus</strong> data changing unexpectedly, concepts drifting, and bias</td></tr><tr><td style="text-align:left"><strong>Testing Focus</strong></td><td style="text-align:left">Unit/integration tests for code logic</td><td style="text-align:left">Data validation, statistical tests, comparing offline vs. online (A/B), shadow deployments</td></tr><tr><td style="text-align:left"><strong>What You Deploy</strong></td><td style="text-align:left">A binary or container</td><td style="text-align:left">Model weights, feature store setup, training code, a snapshot of the training data</td></tr><tr><td style="text-align:left"><strong>Lifespan</strong></td><td style="text-align:left">Code often stays stable once shipped</td><td style="text-align:left">Model performance naturally degrades; retraining is a normal part of life</td></tr><tr><td style="text-align:left"><strong>The Team You Need</strong></td><td style="text-align:left">Software Engineers + DevOps</td><td style="text-align:left">Software Engineers + ML Engineers + Data Scientists + Analysts + Domain Experts</td></tr><tr><td style="text-align:left"><strong>Fixing Mistakes</strong></td><td style="text-align:left">Rollbacks to a previous version</td><td style="text-align:left">Rollbacks, <strong>plus</strong> quick data filters, staged feature roll-outs, model &quot;gates&quot;</td></tr><tr><td style="text-align:left"><strong>The Toolbox</strong></td><td style="text-align:left">Git, CI, CD</td><td style="text-align:left">Git <strong>and</strong> tools like MLflow, Feature Stores, Experiment Trackers</td></tr></tbody></table>
<p>Let&#x27;s understand a few of those key differences:</p>
<ul>
<li><strong>Data is like a whole other codebase you didn&#x27;t write.</strong> Every time your dataset updates, it&#x27;s as if code changes. This adds a huge layer of complexity.</li>
<li><strong>&quot;It works&quot; isn&#x27;t always a yes/no answer.</strong> Because ML deals with probabilities, You need to think in terms of statistical confidence.</li>
<li><strong>Models get old.</strong> Unlike a piece of software that might run unchanged for years, models age. You&#x27;re signing up for a long-term relationship: retraining them, possibly re-labelling data, and constantly re-evaluating their performance.</li>
<li><strong>The stakes are higher when decisions are automated by machines.</strong> Ethical considerations like fairness and avoiding bias, along with privacy and regulatory compliance, become central, not just afterthoughts.</li>
</ul>
<p>To excel in ML engineering, you need strong software skills, to manage data carefully, and to guide AI learning effectively.
It’s a challenging, rapidly evolving field, but getting it right means building powerful, reliable, and responsible AI.</p></article>
      ]]></content:encoded>
    </item>
    <item>
      <title>AI Agents could be true </title>
      <link>https://cleartax.in/ai/posts/ai-agents-user-agents/</link>
      <guid isPermaLink="true">https://cleartax.in/ai/posts/ai-agents-user-agents/</guid>
      <pubDate>Tue, 29 Apr 2025 00:00:00 GMT</pubDate>
      <author>Ankit Solanki</author>
      <description><![CDATA[Could you use AI to build true user agents, truly empower users?]]></description>
      <content:encoded><![CDATA[
<article class="prose dark:prose-invert"><p>In HTTP, browsers are called ‘<a href="https://www.w3.org/WAI/UA/work/wiki/Definition_of_User_Agent" target="_blank" rel="noopener">user agents</a>’ — they were meant to act on
behalf of users. This initial positioning has sort-of carried forward
even now:</p>
<ul>
<li>
<p>Browsers are one of the few pieces of software left that allow plugins
&amp; addons, that allow customisability.</p>
</li>
<li>
<p>Browsers let users <a href="https://support.mozilla.org/en-US/kb/change-fonts-and-colors-websites-use" target="_blank" rel="noopener">override choices</a>. You can set your own min
font size, overriding any site’s CSS. You can use reader modes.</p>
</li>
<li>
<p>Browsers let users inspect applications via devtools, script
applications by allowing users to run code on a developer console.</p>
</li>
</ul>
<p>Browsers are one of the last vestiges of an old-school way of building
software: software that is extensible, that users can control and change
to better suit their needs.</p>
<p>Most current software we use today is the opposite.</p>
<ul>
<li>
<p>Most software (whether SaaS, mobile apps or desktop apps) is not
programmable.</p>
</li>
<li>
<p>Decisions are taken by product builders, not users. If a product owner
hasn’t added a feature that you want, you can’t go and add it
yourself!</p>
</li>
<li>
<p>Most software doesn’t empower users, it enforces rigid workflows.</p>
</li>
<li>
<p>We have taken away most customisability. Each option in a SaaS product
is expensive to support, and it’s rational as product builders to
reduce surface area and remove options that 99.9% of your users don’t
use.</p>
</li>
</ul>
<p>All of this happens because today’s software users aren’t programmers:
they don’t know the internals of how software, operating systems work.
Most software is built to be used by the mass market and as a result it
encodes a set of assumptions about how it&#x27;s meant to be used.</p>
<p><strong>AI agents could reverse this trend.</strong></p>
<p>I can see a world where users have personal AI agents running on their
behalf. It wouldn’t matter if a product is scriptable or not, or if it
has an API or not — users could just ask the agent to automate a task on
their behalf! The agent may use the API, or it may just resort to
spinning up a browser and clicking on pixels.</p>
<p>What happens when most users are able to finally take control of the
tools that they use daily? I think it&#x27;s finally time for general purpose
compute to be actually accessible by the masses.</p>
<p>This fills me with optimism; similar to the heady days of Web 2.0: when
everyone wasn’t so jaded about empowering users, exposing APIs was the
norm, mashups were possible and products like <a href="https://en.wikipedia.org/wiki/Yahoo_Pipes" target="_blank" rel="noopener">Yahoo Pipes</a> were
built.</p>
<p>This is a perspective to keep in mind as we build AI-first products —
let&#x27;s enable our users to do great things!</p></article>
      ]]></content:encoded>
    </item>
    <item>
      <title>A Beginner</title>
      <link>https://cleartax.in/ai/posts/a-beginners-guide-to-prompt-engineering/</link>
      <guid isPermaLink="true">https://cleartax.in/ai/posts/a-beginners-guide-to-prompt-engineering/</guid>
      <pubDate>Tue, 15 Apr 2025 00:00:00 GMT</pubDate>
      <author>Vinay Shashidhar</author>
      <description><![CDATA[Explains prompt engineering as the skill of crafting clear, specific instructions for AI, using context, constraints, and iterative refinement to achieve better results.]]></description>
      <content:encoded><![CDATA[
<article class="prose dark:prose-invert"><p>A large part of our daily work involves us regularly interacting with AI. Whether it is someone writing code, creating documents, generating images or simply drafting emails, AI has become an integral part of our lives. And if recent trends are any indication, their influence is going to grow even more. It makes even more sense to step back and reflect on whether we are unlocking their full potential.</p>
<p>We all have faced instances when you ask for something simple, and get a confusing, generic, or just plain <em>wrong</em> answer. The good news is, it&#x27;s often not the AI&#x27;s fault – it&#x27;s how we&#x27;re asking. Welcome to the world of <strong>Prompt Engineering</strong> – the art and science of crafting instructions that help AI understand <em>exactly</em> what you want. Think of it like giving directions. Vague directions lead to getting lost. Clear, specific directions get you to your destination efficiently.</p>
<p>We will be referring to the concepts mentioned in the article <a href="https://www.kaggle.com/whitepaper-prompt-engineering" target="_blank" rel="noopener">here</a>. For additional insights, explore the original source materials on this topic.</p>
<p><strong>What is Prompt Engineering Anyway?</strong></p>
<p>Think of AI like a smart prediction machine. It guesses the next word based on what you give it. A &quot;<em><strong>prompt</strong></em>&quot; is simply your instruction. Prompt engineering is the skill of writing good instructions to get the best possible results from the AI. At its core, it&#x27;s about communicating clearly with AI systems. Well-crafted prompts can save time and produce more useful results.</p>
<p><strong>What Actually Goes Into a Prompt?</strong></p>
<p>A prompt can have several parts working together.</p>
<p><img alt="" loading="lazy" width="0" height="0" decoding="async" data-nimg="1" style="color:transparent;width:100%;height:auto" src="https://assets1.cleartax-cdn.com/content-prod/images/prompt-engineering.png"/></p>
<ul>
<li><strong>The Instruction/Task:</strong> What do you want the AI to <em>do</em>? (e.g., &quot;Summarize,&quot; &quot;Translate,&quot; &quot;Write,&quot; &quot;Explain,&quot; &quot;Generate code&quot;). This is the core action.</li>
<li><strong>Input Data:</strong> The specific text, data, or topic the AI needs to work on (e.g., the article to summarize, the sentence to translate).</li>
<li><strong>Output Indicator/Format:</strong> How do you want the answer presented? (e.g., &quot;in bullet points,&quot; &quot;in a friendly tone,&quot; &quot;limit to 100 words&quot;).</li>
<li><strong>Context Setting and Specificity:</strong> Giving the AI background helps it understand the &#x27;why&#x27; and &#x27;who&#x27; behind your request. <strong>For example:</strong> Instead of saying &quot;<em>Write an email about the project update</em>,&quot; try giving more specifics: &quot;<em>Write a short, professional email to the marketing team summarizing the key results from the Q3 social media campaign project update. Mention the 15% increase in engagement.</em>&quot;</li>
</ul>
<p>You don&#x27;t always need all parts, but combining them thoughtfully makes your prompts much more powerful.</p>
<h3 id="object-object"><strong>Setting the Rules: Using Constraints Effectively</strong></h3>
<p>Constraints help focus the AI&#x27;s response in specific directions, similar to how parameters narrow down search results</p>
<ul>
<li><strong>Length:</strong> &quot;Summarize this in 50 words&quot;</li>
<li><strong>Format:</strong> &quot;List the ideas as bullet points&quot;</li>
<li><strong>Tone/Style:</strong> &quot;Explain this in a formal tone&quot;</li>
<li><strong>Negative Constraints:</strong> &quot;Don&#x27;t use technical jargon,&quot; &quot;Avoid mentioning price&quot;</li>
<li><strong>Real-world example:</strong>
<ul>
<li><em>Less effective:</em> &quot;Suggest team-building ideas.&quot;</li>
<li><em>More effective:</em> &quot;Suggest 3 low-cost team-building activity ideas for a remote team of 10 software engineers. Focus on activities that encourage collaboration and take about 1 hour. Present them as a numbered list with a brief description for each. Avoid virtual escape rooms.&quot;</li>
</ul>
</li>
</ul>
<h3 id="object-object"><strong>The Power of Iterative Refinement</strong></h3>
<p>Prompt engineering is often a process of trial, error, and refinement – think of it as a conversation rather than a one-time command:</p>
<ol>
<li><strong>Start Simple:</strong> Write your initial prompt based on your goal.</li>
<li><strong>Analyze the Output:</strong> Did the AI understand? Was the result accurate? Was the format correct?</li>
<li><strong>Identify Gaps:</strong> What was missing or wrong? Was the prompt too vague? Did it lack context? Did you forget a constraint?</li>
<li><strong>Refine the Prompt:</strong> Adjust your prompt based on your analysis. Add specificity, context, constraints, or examples.</li>
<li><strong>Repeat:</strong> Try the new prompt and continue refining until you get the desired result.</li>
</ol>
<h3 id="object-object"><strong>Prompting Frameworks</strong></h3>
<p>For those of us who want to delve deeper into formal prompting techniques, here are several powerful approaches with practical examples:</p>
<ol>
<li>
<p><strong>Basic Prompt (Zero-Shot)</strong> – This is the simplest approach with no examples:<br/>
<em>&quot;Summarize this meeting transcript and list the action items&quot;</em></p>
</li>
<li>
<p><strong>Adding Specificity and Constraints</strong> – Adding details about what you want:<br/>
<!-- -->&quot;<em>Summarize the key decisions made in the following meeting transcript in 3-4 bullet points. Then, list all action items mentioned using the format: &quot;- [Action Item]: [Owner] - Due: [Deadline]&quot;</em>&quot;</p>
</li>
<li>
<p><strong>Role Prompting</strong> – Asking the AI to adopt a specific perspective:<br/>
<!-- -->&quot;<em>Act as an efficient Project Manager reviewing a meeting transcript. Your goal is to quickly understand outcomes and track tasks. First, summarize the key decisions made in the following meeting transcript in 3-4 concise bullet points. Second, rigorously extract all action items assigned. List them using the format: &quot;- [Action Item]: [Owner] - Due: [Deadline]&quot;. If an owner or deadline isn&#x27;t explicitly mentioned, use &#x27;TBD&#x27;.&quot;</em></p>
</li>
<li>
<p><strong>Chain of Thought</strong> – Guiding the AI through a step-by-step reasoning process:<br/>
<em>&quot;Act as an efficient Project Manager reviewing a meeting transcript. Your goal is to quickly understand outcomes and track tasks. Before generating the final summary and action list, follow these steps:</em></p>
<ol>
<li><em>Read through the transcript to identify the main topics discussed.</em></li>
<li><em>Pinpoint the key decisions reached for each topic.</em></li>
<li><em>Scan the transcript specifically for commitments or tasks assigned (action items). Note down the task, who is assigned (Owner), and any mentioned deadline.</em></li>
<li><em>Consolidate the key decisions into a 3-4 bullet point summary.</em></li>
<li><em>Format the extracted action items clearly using: &quot;- [Action Item]: [Owner] - Due: [Deadline]&quot; (Use &#x27;TBD&#x27; if owner/deadline is unclear).</em></li>
</ol>
<p><em>Now, provide the final output based on this process.</em>&quot;</p>
</li>
<li>
<p><strong>Reflections</strong> – Adding a layer of self-evaluation to the AI&#x27;s process:<br/>
<!-- -->&quot;<em>After extracting the action items using the CoT steps, re-read the summary of key decisions. Does each action item clearly map to a decision or discussion point? If not, double-check the transcript for that action item&#x27;s context. Also, generate two versions of the action item list based on slightly different interpretations of the transcript, and present the most likely version based on the discussion flow.</em>&quot;</p>
</li>
</ol>
<h3 id="object-object"><strong>Wrapping Up: The Art of Clear Communication</strong></h3>
<p>Prompt engineering is fundamentally about clear communication with AI systems. Specific instructions, relevant context, occasional examples, and format guidelines typically lead to better results.<br/>
<!-- -->Many users find an iterative approach works well: starting with simpler prompts and gradually refining them based on results.</p>
<p><strong>Helpful Shortcut:</strong> Consider asking the AI itself to help craft effective prompts. For example, you might type &quot;What would be a good prompt to generate a marketing email about our new product launch?&quot; The AI can often suggest well-structured prompts tailored to your specific needs.</p></article>
      ]]></content:encoded>
    </item>
    <item>
      <title>Evals are your IP</title>
      <link>https://cleartax.in/ai/posts/evals-are-your-ip/</link>
      <guid isPermaLink="true">https://cleartax.in/ai/posts/evals-are-your-ip/</guid>
      <pubDate>Fri, 11 Apr 2025 00:00:00 GMT</pubDate>
      <author>Satwik Gokina</author>
      <description><![CDATA[Why Writing Evals is the Highest-Leverage Action for AI Developers]]></description>
      <content:encoded><![CDATA[
<article class="prose dark:prose-invert"><p><em>&quot;Evals are your IP&quot;</em> — <a href="https://www.youtube.com/live/L89GzWEILkM?t=22058s" target="_blank" rel="noopener" title="Anthropic presentation @ Agents at work 2025">Alexander Bricken, Anthropic</a></p>
<p><em>&quot;We will achieve every evaluation we can state.&quot;</em> — <a href="https://zhengdongwang.com/2024/12/29/2024-letter.html" target="_blank" rel="noopener" title="2024 letter - Zhengdong Wang personal blog">Zhengdong Wang, Deepmind</a></p>
<p><em>“The boring yet crucial secret behind good system prompts is test-driven development. You don&#x27;t write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.”</em> — <a href="https://x.com/AmandaAskell/status/1866207266761760812" target="_blank" rel="noopener" title="Amanda Askell&#x27;s Tweet">Amanda Askell, Anthropic</a></p>
<p>The last quote is by the person who writes Claude’s system prompts. Why are state-of-the-art researchers highlighting the age-old concept of test-driven development (TDD)? Haven’t we moved on to discussing more trendy things like Agents and AGI?</p>
<p>The truth is test-driven development is more important to AI systems than traditional software systems. TDD is now called Evaluations (or Evals) when applied to AI systems—this isn&#x27;t merely rebranding for hype. Evals are subtly different from tests.</p>
<h2 id="what-are-evals">What are Evals?</h2>
<p>Any assessment of the performance of an AI system is called an Eval.</p>
<table><thead><tr><th>Input</th><th>Expected output/ behavior</th></tr></thead><tbody><tr><td>1 + 1</td><td>2</td></tr><tr><td>How to compare the old vs new regime?</td><td>Answer should include this link: [link to the old vs new regime widget]</td></tr><tr><td>Hi</td><td>Any warm greeting of the user by name</td></tr><tr><td>What is 80C?</td><td>Should have the same information as this gold standard answer: “80C is an income tax section …”</td></tr><tr><td>My PAN number is AAPA0001</td><td>The system should use the <code>update_context</code> tool with argument ‘AAPA0001’</td></tr></tbody></table>
<p>Only the first two resemble traditional unit tests. The next two evals require subjective interpretation. The final Eval addresses the system&#x27;s behavior rather than its output.</p>
<h2 id="how-do-we-eval-the-evals">How do we eval the Evals?</h2>
<p>The ways to execute evals can vary significantly.</p>
<p>The simplest form involves basic operations like comparisons (<code>==</code>, <code>!=</code>, <code>&lt;&gt;</code>), regex operations, or even inspecting and asserting function calls. These resemble traditional unit tests.</p>
<p>Another source of evals is customers—has the customer marked the conversation as helpful, given a thumbs-up, thumbs-down, etc.?</p>
<p>The new approach, called <strong>LLM as a judge</strong>, leverages LLMs for evals requiring subjective interpretation.</p>
<ul>
<li>
<p>An LLM can judge whether the greeting is “warm and friendly.”</p>
</li>
<li>
<p>An LLM can compare the provided answer to a gold-standard answer and rate them on completeness, tone, verbosity, etc.</p>
</li>
</ul>
<h2 id="evaluating-probabilistic-systems">Evaluating probabilistic systems</h2>
<p>Traditional unit tests are pass or fail. We do not deploy code if even one test fails. However, AI system outputs are probabilistic. Previously, we even described evaluation methods that are not deterministic.</p>
<p>Instead, we run the eval multiple times and track the success rate as a percentage. Ideally, we should run it 30 times for statistical significance, though 10 is also acceptable.</p>
<p><strong>Summary:</strong> Traditional unit tests—Pass or Fail; AI systems with Evals—% Success.</p>
<h2 id="how-good-should-these-evals-be">How good should these evals be?</h2>
<p>A common reaction to seeing initial evals is that they are simple and don&#x27;t capture the system&#x27;s complexity.</p>
<p>However, evals only need to be meaningful, quick, and inexpensive. It’s encouraged to write many evals testing diverse aspects.</p>
<p>Ensuring diverse evals means core functionality breaks will likely affect lower-level functionalities too. This fundamental principle of unit testing applies here as well.</p>
<h2 id="why">Why?</h2>
<p>Why emphasize evals so much? Here is what we gain:</p>
<h3 id="testing-new-models">Testing new models</h3>
<p>Tomorrow, if a new model called Gehri Sooch™ launches, all we need to do is update the LLM API to the new model and execute the eval report.</p>
<p>We benefited from this last consumer season. Because we had evals in place, we confidently switched from GPT-4o to GPT-4o-mini in the season&#x27;s last week, reducing costs by 4x.</p>
<h3 id="faster-iteration">Faster iteration</h3>
<p>Evals give us confidence to iterate rapidly. Major reworks and integration of new capabilities become manageable as long as your evals are trustworthy. Faster iterations drive quicker product success.</p>
<h3 id="continuous-learning">Continuous learning</h3>
<p>Whenever a failure occurs in an AI system, translate that failure state into a set of evals. This ensures constant tracking.</p>
<h3 id="evals-as-your-ip">Evals as your IP</h3>
<p>Imagine a project running for a year—you likely have hundreds or thousands of evals by now. All failure states have been identified and translated into evals.</p>
<p>In the AI era, code is cheap—a tax Q&amp;A bot can be built by an engineer in just an hour. What is valuable is understanding when and how AI fails. Those evals inform your system design.</p>
<p>In the future, imagine an LLM agent iteratively rewriting itself until the eval score surpasses a defined threshold. Thus, your system can be derived solely from evals. This is why evals are your IP.</p></article>
      ]]></content:encoded>
    </item>
  </channel>
</rss>