The Evaluator-Optimizer is one of five workflow patterns Anthropic defines in Building Effective Agents: one LLM generates, another evaluates and provides feedback in a loop. Claude Code doesn't have that automatic loop — but you can build it as an inline skill invoked on demand.
The pattern's core: every claim from the previous output is checked against real evidence — source code, official documentation, configuration files. Not opinions. Verifiable facts.
Result:
/evaluate # 1 evaluation pass
/evaluate 2 # 2 passes (the second evaluates the evaluation itself)
Why a skill, not a sub-agent
The evaluator needs to see what was just produced. A sub-agent or a skill with context: fork loses conversation context — it would have to rediscover everything from scratch.
flowchart TD
A["Does the evaluator need to see\nthe previous output?"] -->|Yes| B["Inline skill"]
A -->|No| C["Needs isolation?"]
C -->|Yes| D[Sub-agent]
C -->|No| E["Skill with context: fork"]
The official documentation backs this: "Consider Skills instead when you want reusable prompts or workflows that run in the main conversation context rather than isolated subagent context."
The skill
---
name: evaluate
description: Evaluator-Optimizer pattern. Critically evaluates every claim,
decision, and assertion from the previous output against verifiable evidence
from code, documentation, configuration, and other sources.
disable-model-invocation: true
argument-hint: [passes]
allowed-tools: Read, Grep, Glob, Bash, WebFetch, WebSearch
---
disable-model-invocation: true— you decide when to evaluate, not Claudeallowed-tools— the evaluator needs to read and search to verify claims$ARGUMENTS—/evaluate 2runs two passes; no argument runs one
The verdict system
| Verdict | Meaning |
|---|---|
VERIFIED |
Direct evidence supports it |
PARTIALLY CORRECT |
Core idea holds but details are wrong |
UNVERIFIED |
No evidence found to confirm or deny |
INCORRECT |
Evidence directly contradicts it |
OUTDATED |
Was true at some point but current state differs |
Rule: every verdict cites a specific source — file path with line number, URL, command output. "I believe" is not evidence. No source = UNVERIFIED.
In practice
I ran /evaluate on the very analysis that produced the skill. Result: 21 claims, 18 verified, 3 partially correct. None incorrect — but three solid assumptions had nuances I wouldn't have caught without the pattern. The full article covers the process end to end: Evaluator-Optimizer in Claude Code: From Pattern to Skill.
Official docs: Skills · Building Effective Agents — Anthropic