Evaluator-Optimizer in Claude Code: From Pattern to Skill

In December 2024, Anthropic published Building Effective Agents — a guide defining five workflow patterns for LLM-based systems: prompt chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer.

If you use Claude Code, you're already inside several of these patterns. Every time Claude delegates a search to an Explore sub-agent or distributes tasks across parallel workers, you're running an orchestrator-workers pattern. Claude Code provides 6 extension points — skills, hooks, sub-agents, MCP, plugins, agent teams — each designed for different scenarios. The question is which one fits each pattern.

I've been studying and applying these patterns as part of my professional practice in Agentic AI. Most of them map naturally to Claude Code's architecture. But one didn't fit: the evaluator-optimizer.

The pattern that always catches something

Anthropic's definition is straightforward:

"One LLM call generates a response while another provides evaluation and feedback in a loop."

One LLM generates, another evaluates and provides feedback. In a loop.

The problem is that conversational tools like Claude Code don't have that automatic loop. There's no second LLM evaluating the first one's output. There's a conversation. And in that conversation, the only loop is you.

So I started using it manually. After an extensive planning session, or after exploring a codebase and producing a report, or after implementing a complex feature, I'd ask Claude to evaluate its own output. Every claim, every decision, checked against reality: source code, official documentation, configuration files, verifiable facts.

The result was consistent: it always found something. Not critical errors — nuances. A partially correct assumption. An API reference based on the model's memory instead of the actual code. A decision that held up but whose justification wasn't precise.

For medium to large tasks, the pattern proved its value empirically. So I decided to formalize it.

The implementation question: skill or sub-agent

Claude Code offers two main primitives for packaging reusable behavior: skills and sub-agents. The choice between them isn't aesthetic — it depends on a specific technical factor.

If you're not familiar with skills, the Claude Code skills guide covers the full anatomy. For this decision, what matters is where each one runs.

An inline skill (without context: fork) runs inside the main conversation. It sees everything: the plan, the report, the code changes. The official documentation confirms:

"This content runs inline so Claude can use it alongside your conversation context."

A sub-agent runs in its own isolated context:

"Each subagent runs in its own context window with a custom system prompt, specific tool access, and independent permissions."

A skill with context: fork also loses access:

"It won't have access to your conversation history."

For the evaluator-optimizer pattern, this settles it. The evaluator needs to see what was just produced. Without conversation context, it would have to rediscover everything from scratch — losing exactly what makes the pattern valuable.

flowchart TD
    A["Does the evaluator need to see\nthe previous output?"] -->|Yes| B["Inline skill"]
    A -->|No| C["Needs isolation?"]
    C -->|Yes| D[Sub-agent]
    C -->|No| E["Skill with context: fork"]

The answer was clear: an inline skill. Not a sub-agent. Not a forked skill. The official Claude Code documentation supports this directly:

"Consider Skills instead when you want reusable prompts or workflows that run in the main conversation context rather than isolated subagent context."

Building the skill

The complete SKILL.md frontmatter:

---
name: evaluate
description: Evaluator-Optimizer pattern. Critically evaluates every claim,
  decision, and assertion from the previous output against verifiable evidence
  from code, documentation, configuration, and other sources.
disable-model-invocation: true
argument-hint: [passes]
allowed-tools: Read, Grep, Glob, Bash, WebFetch, WebSearch
---

Design decisions that matter:

`disable-model-invocation: true`

The user controls when to evaluate, not Claude. You don't want the evaluator firing automatically after every response — only when the output warrants it. With this flag, the skill only responds to /evaluate.

`allowed-tools`

The evaluator needs to search and read to verify claims. Read, Grep, Glob for local code. Bash for verification commands. WebFetch and WebSearch for external documentation. For a deeper look at how this field works, see allowed-tools in skills.

`$ARGUMENTS` for multiple passes

Skills support arguments via $ARGUMENTS. /evaluate 2 runs two passes; /evaluate with no arguments runs one. There are no native default values — the default is handled in the skill instructions.

The verdict system

Every claim from the previous output is classified:

Verdict	Meaning
`VERIFIED`	Direct evidence supports it
`PARTIALLY CORRECT`	Core idea holds but details are wrong
`UNVERIFIED`	No evidence found to confirm or deny
`INCORRECT`	Evidence directly contradicts it
`OUTDATED`	Was true at some point but current state differs

The key rule: every verdict must cite a specific source — file path with line number, URL, command output. "I believe" is not evidence. If there's no source, the verdict is UNVERIFIED.

Multiple passes

Pass 1 evaluates the original output. Pass 2 evaluates the evaluation itself: was the evidence read correctly? Were claims missed? Was verification superficial? It's the pattern applied to itself.

This isn't an automatic loop — it's the same principle applied with intention. For deterministic automation (pre-commit checks, linting), hooks are the right tool. For judgment-based evaluation, a skill.

The meta moment: evaluating the evaluator

I ran /evaluate on the very analysis that produced the skill — the skill vs sub-agent comparison, the claims about arguments, the final recommendation.

The result: 21 claims identified. 18 verified. 3 partially correct.

The interesting findings:

Claim	Verdict	Issue
"Sub-agent receives a summary of the conversation"	`PARTIALLY CORRECT`	It's not a "summary" — it's a delegation message Claude actively constructs. The distinction matters: Claude chooses what to include
`WebSearch` in `allowed-tools` is valid	`PARTIALLY CORRECT`	The tool exists and works, but doesn't appear in any official documentation examples. It works, but without documented backing
Forked skills don't support natural multiple passes	`PARTIALLY CORRECT`	Each invocation is isolated, but results return to the main conversation — the next invocation could receive previous findings via delegation message

No claim was INCORRECT. But three assumptions that felt solid had nuances I wouldn't have caught without the pattern. That's the point: it doesn't look for critical errors — it finds the cracks that open under pressure.

What's next

The evaluator-optimizer was the hardest pattern to place because it doesn't map to a Claude Code feature — it maps to a practice. It's not a hook that runs automatically or a sub-agent Claude invokes on its own. It's a deliberate decision to stop and verify.

This is the first of several agentic patterns I'm turning into Claude Code workflows. The others — prompt chaining, orchestrator-workers — map more directly to the native architecture. But the evaluator-optimizer required lateral thinking: understanding that the best place for an evaluator isn't an isolated process, but the center of the conversation.

If you're building serious workflows with Claude Code — the Claude Code professional guide covers the full ecosystem — these patterns are what take you from ad-hoc prompts to repeatable, verifiable processes.

Reference: Building Effective Agents — Anthropic | Claude Code Skills — Official Documentation

The pattern that always catches something

The implementation question: skill or sub-agent

Building the skill

`disable-model-invocation: true`

`allowed-tools`

`$ARGUMENTS` for multiple passes

The verdict system

Multiple passes

The meta moment: evaluating the evaluator

What's next

Get only what matters

Table of Contents

Table of Contents

The pattern that always catches something

The implementation question: skill or sub-agent

Building the skill

disable-model-invocation: true

allowed-tools

$ARGUMENTS for multiple passes

The verdict system

Multiple passes

The meta moment: evaluating the evaluator

What's next

Get only what matters

Table of Contents

Table of Contents

`disable-model-invocation: true`

`allowed-tools`

`$ARGUMENTS` for multiple passes