Post

CodingAgent Architecture & Design

A deep dive into the internal architecture of CodingAgent — a terminal-based agentic coding assistant, covering the core loop, tool system, parallel execution, caching, sub-agents, context management, and more.

CodingAgent Architecture & Design

Design document for the CodingAgent codebase. Last updated: 2026-02-25.


1. Overview

CodingAgent is a terminal-based agentic coding assistant. It runs a recursive loop: call the Claude API → parse tool-use blocks → execute tools → feed results back → repeat until the model stops emitting tool calls.

Key properties:

  • Single-file ESM TypeScript project (no build step required via tsx)
  • 9 built-in tools (Read, Write, Edit, Glob, Grep, Bash, Task, WebFetch, WebSearch)
  • Parallel execution of read-only tools, barrier semantics for write tools
  • Multi-level sub-agent spawning with context isolation
  • Two-tier context compaction (micro + auto)
  • Smart exploration cache with file-level invalidation
  • Session persistence and resume
  • Multi-judge eval gate for verifying task completion (opt-in via --eval)

2. Project Structure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
codingagent/
├── package.json                        # ESM TypeScript, Anthropic SDK + glob + diff deps
├── tsconfig.json                       # ES2022, NodeNext module, strict mode
├── src/
│   ├── index.ts                        # Entry point: CLI arg parsing, REPL, command handlers
│   ├── ansi-diff.d.ts                  # Type declarations for ansi-diff module
│   │
│   ├── core/                           # Core agent engine
│   │   ├── types.ts                    # Core type definitions (Message, Tool, ToolContext, etc.)
│   │   ├── client.ts                   # Anthropic SDK client singleton
│   │   ├── context.ts                  # ToolContext creation + sub-agent context cloning
│   │   ├── loop.ts                     # Core agentic loop (streaming API + tool execution)
│   │   ├── agent.ts                    # Sub-agent spawning (Explore, Plan, Bash, general-purpose)
│   │   ├── compaction.ts               # Token estimation, context compaction, micro-compaction
│   │   ├── streaming-executor.ts       # Parallel tool execution engine with concurrency control
│   │   └── mcp-client.ts              # MCP (Model Context Protocol) server connections
│   │
│   ├── config/                         # Configuration & skill loading
│   │   ├── config.ts                   # Config loading from env + ~/.claude/settings.json
│   │   └── skills.ts                   # Skill/instruction file loading (Claude, Copilot, etc.)
│   │
│   ├── tools/                          # Built-in tool implementations
│   │   ├── index.ts                    # Tool registry (10 tools)
│   │   ├── read.ts                     # File read with line numbers, offset/limit, image support
│   │   ├── write.ts                    # Atomic file writes with directory creation
│   │   ├── edit.ts                     # Exact string replacement with diff preview
│   │   ├── glob.ts                     # File pattern matching (sorted by mtime)
│   │   ├── grep.ts                     # ripgrep-based search (regex, file types, context)
│   │   ├── bash.ts                     # Shell command execution with timeout + env sanitization
│   │   ├── task.ts                     # Sub-agent spawning tool
│   │   ├── web.ts                      # WebFetch + WebSearch
│   │   ├── open.ts                     # Open files/URLs in native apps
│   │   ├── validate.ts                 # Input validation helpers
│   │   └── fs-utils.ts                 # Atomic file replacement utility
│   │
│   ├── ports/                          # Transport I/O abstractions
│   │   ├── io-port.ts                  # IOPort interface (InputPort + OutputPort)
│   │   ├── io-port.test.ts             # IOPort tests
│   │   ├── terminal-port.ts            # Terminal stdin/stdout IOPort adapter
│   │   ├── telegram-port.ts            # Telegram Bot API IOPort adapter
│   │   └── teams-port.ts              # Microsoft Teams IOPort adapter
│   │
│   ├── session/                        # Session persistence & runner
│   │   ├── session.ts                  # Save/load sessions to ~/.codingagent/sessions/
│   │   └── session-runner.ts           # Transport-agnostic session loop driver
│   │
│   ├── eval/                           # Work quality evaluation
│   │   └── eval.ts                     # Multi-judge eval gate (correctness, completeness, goal)
│   │
│   ├── ui/                             # Terminal UI & REPL
│   │   ├── ui.ts                       # Colors, spinners, formatting, OutputManager
│   │   ├── commands.ts                 # REPL slash-command registry & metadata
│   │   └── frecency.ts                 # Command usage frequency × recency tracker
│   │
│   ├── utils/                          # Shared utilities
│   │   ├── retry.ts                    # Retry/backoff, abort-aware sleep, signal combiners
│   │   └── explore-cache.ts            # LRU exploration cache with mtime/TTL invalidation
│   │
│   ├── gateway/                        # Multi-transport gateway
│   │   ├── gateway.ts                  # Gateway host process (manages transports + worker)
│   │   ├── ipc-protocol.ts             # IPC message types (host ↔ worker)
│   │   └── agent-worker.ts             # Forked child process running the agent
│   │
│   └── scripts/                        # Standalone entry points
│       ├── telegram.ts                 # Telegram bot entry point
│       └── teams.ts                    # Teams bot entry point
│
└── dist/                               # Compiled output (mirrors src/ structure)

3. Core Loop

The agentic loop lives in loop.ts and is an async generator (agenticLoop()):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
User prompt
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  while (true) {                                         │
│    1. Micro-compaction  — trim old tool results (>10KB)  │
│    2. Auto-compaction   — summarize if near token limit  │
│    3. API call          — stream response from Claude    │
│    4. Parse tool_use    — queue to StreamingToolExecutor │
│       (tools start executing DURING streaming)          │
│    5. Collect results   — async generator yields them    │
│    6. If stop_reason == "tool_use" → loop               │
│    7. If stop_reason == "end_turn":                     │
│       a. If eval disabled → break                       │
│       b. Run multi-judge eval gate (§15):               │
│          ├─ Judges evaluate in parallel                  │
│          ├─ Majority pass → break                       │
│          └─ Majority fail → inject refinement → loop    │
│  }                                                      │
└─────────────────────────────────────────────────────────┘
    │
    ▼
Assistant response displayed to user

Retry logic: 3 retries with exponential backoff (1s base, 30s cap, ±25% jitter). Non-retryable: 400, 401, 403, 404. Retryable: 429, 529, 5xx.


4. Tool System

4.1. Tool Interface

1
2
3
4
5
6
7
interface Tool {
  name: string;
  description: string;
  inputSchema: Anthropic.Tool["input_schema"];
  isConcurrencySafe: boolean;          // true → can run in parallel
  execute(input, context): Promise<ToolResult>;
}

4.2. Tool Categories

CategoryToolsConcurrency-SafeCacheable
File ReadRead
File SearchGlob, Grep
File WriteWrite, Edit❌ (invalidates cache)
ExecutionBash❌ (invalidates cache)
AgentTask✅ (background)
WebWebFetch, WebSearch

4.3. Read-Before-Write Guard

Both Write and Edit enforce a read-before-write invariant via readFileState (LRU, 500 entries):

  1. Has the file been read?readFileState.get(path) must exist
  2. Is it stale? — Compare current mtime against saved timestamp (1s tolerance)
  3. If either fails → error, model must re-read

This prevents blind overwrites and catches external modifications.

4.4. Atomic File Operations

Both Write and Edit use atomic writes:

  • Write to a temp file (openSync("wx"), exclusive create)
  • renameSync(tmp, target) — atomic on same filesystem
  • Fallback: copy + unlink for cross-filesystem (EXDEV)
  • Symlink-aware: resolves to real path before replacing

5. Parallel Tool Execution

StreamingToolExecutor (in streaming-executor.ts) manages concurrency:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
┌──────────────────────────────────────────────────────────┐
│  CONCURRENCY RULES                                        │
│                                                          │
│  Safe tools (Read, Glob, Grep, WebFetch, WebSearch):     │
│    → Run in parallel (max 8 concurrent)                  │
│    → Blocked if any unsafe tool is running               │
│                                                          │
│  Unsafe tools (Write, Edit, Bash):                       │
│    → Run alone (barrier semantics)                       │
│    → Block until all prior tools finish                  │
│    → Nothing starts until barrier completes              │
│                                                          │
│  Example: [Read₁, Read₂, Write₃, Read₄]                 │
│                                                          │
│    Read₁  ████████                                       │
│    Read₂  ████████        (parallel)                     │
│                    Write₃ ██████████  (barrier, alone)   │
│                                      Read₄ ████  (after) │
└──────────────────────────────────────────────────────────┘

Key optimization: Tools start executing DURING API streaming. As each tool_use block is fully parsed, it’s immediately queued — file reads can complete before the model finishes generating subsequent tool calls.


6. Exploration Cache

6.1. Purpose

When the agent explores a codebase, it often re-reads the same files and re-runs the same searches across turns. The Exploration Cache stores results from read-only tools (Read, Glob, Grep) and returns cached results when the same tool call is repeated — avoiding redundant disk I/O and subprocess spawning.

6.2. Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
┌─────────────────────────────────────────────────────────┐
│  StreamingToolExecutor.executeTool()                     │
│                                                         │
│  ┌─── Read/Glob/Grep ────────┐  ┌── Write/Edit/Bash ──┐│
│  │                            │  │                     ││
│  │  1. cache.get(tool, input) │  │  After execution:   ││
│  │     ├─ HIT  → return cached│  │  → invalidateFile() ││
│  │     └─ MISS → execute tool │  │  → invalidateDir()  ││
│  │  2. Validate freshness:    │  │    (for Bash)       ││
│  │     ├─ Read: check mtime   │  │                     ││
│  │     └─ Glob/Grep: check TTL│  │                     ││
│  │  3. cache.set(tool, result)│  │                     ││
│  └────────────────────────────┘  └─────────────────────┘│
│                    ↕                       ↕             │
│            ┌──────────────┐                              │
│            │ ExploreCache  │                              │
│            │  LRU Map      │                              │
│            │  max: 200     │                              │
│            │  TTL: 30s     │                              │
│            │  (Glob/Grep)  │                              │
│            └──────────────┘                              │
└─────────────────────────────────────────────────────────┘

6.3. Cache Key

Deterministic SHA-256 hash of:

1
SHA-256({ tool: "Read", input: { file_path: "/src/foo.ts", ... }, cwd: "/project" })

Input properties are sorted by key name before hashing to ensure identical tool calls with different property ordering produce the same key.

6.4. Freshness Validation

ToolValidation StrategyRationale
ReadstatSync(path).mtimeMs compared to cached mtimeSingle syscall (~0.1ms) vs full file read (~1-10ms). Catches external edits.
GlobTTL (default 30s)Can’t efficiently mtime-check all files matching a pattern.
GrepTTL (default 30s)Same — searches span many files.

6.5. Invalidation Strategy

The cache is invalidated at multiple levels to ensure correctness:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
┌─────────────────────────────────────────────────────────────┐
│  INVALIDATION TRIGGERS                                       │
│                                                             │
│  1. File-level (Write, Edit):                               │
│     → Remove all Read entries for the modified path         │
│     → Remove Glob/Grep entries whose searchScope            │
│       includes the modified path                            │
│                                                             │
│  2. Directory-level (Bash):                                 │
│     → We can't know which files Bash touched                │
│     → Conservatively invalidate ALL entries under cwd       │
│     → Both Read entries under the dir AND                   │
│       Glob/Grep entries whose scope overlaps the dir        │
│                                                             │
│  3. Manual (REPL commands, session lifecycle):              │
│     → /clear     → cache.clear()                            │
│     → /undo      → cache.clear() (git stash changes files)  │
│     → /resume    → cache.clear() (files may have changed)   │
│     → Session reset → cache.clear()                         │
│                                                             │
│  4. Automatic (per-entry):                                  │
│     → Read: mtime mismatch on cache hit → evict + re-exec  │
│     → Glob/Grep: TTL expired → evict + re-exec             │
│     → LRU eviction when cache exceeds 200 entries           │
│                                                             │
│  5. Error results are NEVER cached                          │
│     → Transient errors (ENOENT, EACCES) should be retried  │
│     → Caching an error would mask the fix                   │
└─────────────────────────────────────────────────────────────┘

6.6. Sub-Agent Integration

When a sub-agent is spawned (cloneContext):

  • The parent’s explore cache is shallow-cloned to the child
  • The child inherits all cached exploration results
  • The child’s invalidations do NOT propagate back to the parent
  • This matches the isolation model used by readFileState

6.7. Configuration

SettingDefaultDescription
maxEntries200LRU eviction cap
directoryTtlMs30,000TTL for Glob/Grep entries
enabledtrueKill switch

6.8. Monitoring

Use the /cache REPL command:

1
2
3
4
5
6
7
8
/cache
Explore Cache
  Entries:      12
  Hits:         47
  Misses:       15
  Invalidations: 3
  Evictions:    0
  Hit rate:     75.8%

7. Sub-Agent System

7.1. Agent Types

TypeModelToolsPurpose
ExploreSmall (Haiku)Read, Glob, GrepFast read-only codebase exploration
PlanMainRead, Glob, GrepArchitecture & implementation planning
BashMainBashFocused command execution
general-purposeMainAllMulti-step research, broad analysis

7.2. Context Isolation

Each sub-agent receives a cloned context:

1
2
3
4
5
6
7
cloneContext(parent)  {
  readFileState: parent.readFileState.clone(),   // Independent copy
  exploreCache:  parent.exploreCache.clone(),    // Inherits cached data
  abortController: childAbort,                   // Linked to parent
  agentId: randomUUID(),                         // Unique identity
  depth: parent.depth + 1,                       // Nesting depth
}
  • AbortController linkage: Aborting the parent aborts all children. Not vice versa.
  • Max depth: 5 levels of nesting (MAX_AGENT_DEPTH).
  • Sub-agents can run in foreground (blocking) or background (fire-and-forget).

8. Context Management

8.1. Token Estimation

Simple heuristic: ~4 chars/token + per-message overhead. Includes system prompt.

8.2. Micro-Compaction (per-turn)

After each turn, tool results older than the 3 most recent and larger than 10KB are replaced with truncation notices: "[content truncated — X chars]".

8.3. Auto-Compaction (threshold-based)

When estimated tokens exceed the compaction threshold (default 160K):

  1. The small model (Haiku) summarizes the conversation
  2. Old messages are replaced with the summary
  3. The 6 most recent messages are preserved
  4. readFileState and exploreCache are cleared

8.4. Orphaned Tool-Use Repair

If a tool_use block has no matching tool_result (from aborts, compaction, or force-quit during a turn), synthetic tool_result entries are injected so the API doesn’t reject the conversation with a 400 error.


9. Session Persistence

Sessions are saved to ~/.codingagent/sessions/<id>.json:

1
2
3
4
5
{
  id: string;
  metadata: { model, cwd, startedAt, lastActiveAt, summary };
  messages: Message[];
}
  • Auto-save: After every turn + on process exit (SIGINT, SIGTERM, beforeExit)
  • Fast metadata: extractMetadataFast() reads only the first 4KB for listing
  • Resume: /resume [id] or --resume [id] restores messages + clears stale caches

10. Configuration

Sources (in precedence order):

  1. ~/.claude/settings.jsonenv object
  2. process.env
  3. Hardcoded defaults
ConfigEnv VarDefault
ModelANTHROPIC_MODELclaude-sonnet-4-20250514
Small modelANTHROPIC_SMALL_FAST_MODELclaude-haiku-3-5-20241022
Base URLANTHROPIC_BASE_URLhttps://api.anthropic.com
API keyANTHROPIC_API_KEY(required)
Max output tokensANTHROPIC_MAX_OUTPUT_TOKENS16384
Compaction thresholdANTHROPIC_COMPACTION_THRESHOLD160000
Debug modeCODINGAGENT_DEBUGfalse
Skill directoriesskillDirs (settings.json only)[] (built-in defaults always loaded)

11. Terminal Output

11.1. The Problem

Node’s readline module manages the REPL prompt on stdout. Internally, it tracks how many display rows the prompt occupies (prevRows). Every prompt redraw (_refreshLine) moves the cursor UP by prevRows, then calls clearScreenDown to erase everything below. This means any text written to stdout asynchronously (e.g., from MCP server loading, background agent completion, config hot-reload notifications) gets wiped by readline’s next redraw — appearing to flash and then disappear.

11.2. Architecture

All terminal output routes through a centralized OutputManager singleton:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
┌──────────────────────────────────────────────────────────┐
│  Modules (loop.ts, config.ts, mcp-client.ts, session.ts, │
│           tools/index.ts, streaming-executor.ts, etc.)   │
│                                                          │
│  printWarning("…")  printInfo("…")  printSuccess("…")   │
│        │                  │                │              │
│        └──────────────────┼────────────────┘              │
│                           ▼                              │
│                    ┌─────────────┐                        │
│                    │OutputManager│ (singleton in ui.ts)   │
│                    │ .log()      │                        │
│                    │ .info()     │                        │
│                    │ .warn()     │                        │
│                    │ .success()  │                        │
│                    │ .write()    │                        │
│                    └──────┬──────┘                        │
│                           │                              │
│              ┌────────────┼────────────┐                 │
│              ▼                         ▼                 │
│     ┌─────────────────┐      ┌─────────────────┐        │
│     │ Non-interactive  │      │ Interactive      │        │
│     │ (-p, piped stdin)│      │ (REPL mode)      │        │
│     │                  │      │                  │        │
│     │ console.log(text)│      │ 1. clearHints()  │        │
│     │ (direct)         │      │ 2. \r\x1b[K      │        │
│     │                  │      │ 3. console.log() │        │
│     │                  │      │ 4. prevRows = 0  │        │
│     │                  │      │ 5. rl.prompt()   │        │
│     └─────────────────┘      └─────────────────┘        │
└──────────────────────────────────────────────────────────┘

11.3. Lifecycle

EventCallEffect
REPL starts (after rl + hintManager created)output.setReadline(rl, hintManager)Enables readline-safe path
/reload (before rl.close())output.detachReadline()Falls back to direct console.log
Non-interactive mode (-p, piped stdin)(never called)Uses console.log directly

11.4. Output Channels

ChannelTargetUsed by
output.log() / .info() / .warn() / .success()stdout (readline-safe)All modules (via printInfo/printWarning/printSuccess)
output.write()stdout (raw, no re-prompt)Streaming assistant text chunks
SpinnerstderrThinking indicator, compaction progress
console.errorstderrFatal errors, startup failures

11.5. Why Spinner Uses stderr

The Spinner class writes exclusively to process.stderr. This avoids interfering with readline’s stdout management and ensures spinner output doesn’t contaminate piped stdout in non-interactive mode (-p).


12. Caching & Invalidation Summary

The system has multiple caching layers, each with its own invalidation strategy:

CacheLocationStrategyInvalidation
Configconfig.tsSingleton, lazy initcachedConfig = null on reset
API clientclient.tsSingletononConfigReset callback
ReadFileStatecontext.tsLRU Map (500 entries)Per-file on write; bulk clear() on /undo, /clear, /resume
ExploreCacheexplore-cache.tsLRU Map (200 entries)File-level (Write/Edit), directory-level (Bash), mtime (Read), TTL (Glob/Grep), manual (REPL commands)
System promptindex.tsSession-stable timestampNew session resets sessionStartDate
Tool registrytools/index.tsStatic MapNever invalidated (tools are compile-time constants)
Session metadatasession.tsFirst-4KB fast pathRe-read on listSessions()

Data flow between caches

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
                    ┌──────────────────────────────────┐
                    │  User prompt / tool execution     │
                    └──────────────┬───────────────────┘
                                   │
               ┌───────────────────┼───────────────────┐
               │                   │                   │
               ▼                   ▼                   ▼
     ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
     │  ExploreCache    │ │ ReadFileState    │ │  Session Store  │
     │  (tool results)  │ │ (file mtimes)    │ │ (messages)      │
     │                  │ │                  │ │                  │
     │  Read hit? ──────┤ │ Read-before-     │ │ Auto-save after │
     │    → return       │ │ write guard      │ │ each turn       │
     │  Read miss? ─────┤ │                  │ │                  │
     │    → execute      │ │ Edit/Write       │ │ /resume restores│
     │    → store result │ │ updates mtime    │ │ + clears caches │
     │                  │ │                  │ │                  │
     │  Write/Edit?     │ │                  │ │                  │
     │    → invalidate  ◄─┤                  │ │                  │
     │  Bash?           │ │                  │ │                  │
     │    → invalidate  │ │                  │ │                  │
     │      directory   │ │                  │ │                  │
     └─────────────────┘ └─────────────────┘ └─────────────────┘

13. REPL Commands

CommandDescription
/helpShow all commands
/clearStart new conversation (clears caches)
/compact [--force]Compress conversation context
/statusShow token usage, cost, turn count
/cacheShow explore cache statistics
/model <name>Switch model mid-session
/smallmodel <name>Switch small model
/tokensShow current token count
/historyShow conversation history
/retryRe-send last prompt
/agents [id]Show background agent status
/saveSave current session
/sessionsList saved sessions
/resume [id]Resume a saved session
/delete-session <id>Delete a saved session
/undoStash uncommitted changes via git
/reloadReload config from settings.json
/quitExit (also Ctrl+C ×2)

14. Multi-Judge Eval Gate

14.1. Purpose

The eval gate is a verification step that runs after the agent declares its work complete (stop_reason: end_turn). Instead of trusting the agent’s judgment that the task is done, multiple AI “judges” independently evaluate the work from different perspectives. The loop only terminates when a majority of judges agree the work is complete. Otherwise, the judges’ feedback is injected as a refinement prompt and the agent continues working.

This addresses a core failure mode of agentic loops: the agent produces a superficially complete response that misses requirements, contains bugs, or doesn’t actually solve the user’s problem. Without eval, the user must manually verify and re-prompt. With eval, the loop self-corrects.

14.2. Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Agent declares "done" (end_turn, no tool calls)
    │
    ▼
┌─────────────────────────────────────────────────────────────┐
│  EVAL GATE (up to MAX_EVAL_ROUNDS = 3)                      │
│                                                             │
│  1. Build eval context:                                     │
│     - Original user request (first user message)            │
│     - Agent work summary (last 20 messages: text + tools)   │
│                                                             │
│  2. Run judges IN PARALLEL (Promise.allSettled):             │
│     ┌─────────────┐ ┌──────────────┐ ┌────────────────┐    │
│     │ Correctness  │ │ Completeness │ │ Goal Alignment │    │
│     │              │ │              │ │                │    │
│     │ Bugs?        │ │ All parts    │ │ Does it solve  │    │
│     │ Logic errors?│ │ addressed?   │ │ the real need? │    │
│     │ Syntax valid?│ │ TODOs left?  │ │ User would be  │    │
│     │              │ │              │ │ satisfied?     │    │
│     └──────┬───────┘ └──────┬───────┘ └──────┬─────────┘    │
│            │                │                │              │
│            ▼                ▼                ▼              │
│     { isComplete, reasoning, refinementSuggestions }        │
│                                                             │
│  3. MAJORITY RULE: completeCount > judges.length / 2        │
│     ├─ YES → yield eval_complete(passed: true) → EXIT LOOP  │
│     └─ NO  → synthesize refinement prompt                   │
│              → inject as user message                       │
│              → yield eval_complete(passed: false)            │
│              → CONTINUE LOOP                                │
└─────────────────────────────────────────────────────────────┘

14.3. Judge Perspectives

Three built-in judges provide complementary viewpoints:

JudgeFocusPass Criteria
CorrectnessBugs, logic errors, syntax validity, edge casesWork correctly fulfills the request without errors
CompletenessAll aspects addressed, no TODOs, consistent changesEvery part of the request is covered
Goal AlignmentUser’s underlying intent, appropriate approachThe result serves the user’s actual need

Each judge returns a structured JSON verdict:

1
2
3
4
5
{
  "isComplete": true,
  "reasoning": "The implementation correctly handles all edge cases...",
  "refinementSuggestions": []
}

14.4. Majority Rule

The work is accepted when more than half of the judges agree it’s complete. With 3 judges, this means 2 must pass. This design:

  • Prevents single-judge blocking: One overly-strict judge can’t hold up completion if the other two are satisfied
  • Catches real issues: When 2+ judges flag problems, there’s likely a genuine deficiency
  • Avoids ties: Odd number of judges gives a clear majority

14.5. Refinement Loop

When eval fails:

  1. Failing judges’ reasoning and suggestions are synthesized into a refinement prompt
  2. The prompt is injected as a user message into the conversation
  3. The agentic loop continues — the model processes the feedback and produces a revised response
  4. When the model stops again, eval runs again (up to MAX_EVAL_ROUNDS = 3)

Example refinement prompt:

1
2
3
4
5
6
7
8
9
10
11
12
13
The following evaluation judges found issues with the work (round 1/3):

**Correctness**: The function doesn't handle null input — it would throw a
TypeError at runtime.
Suggestions:
  - Add a null guard at the start of processData()

**Completeness**: The unit tests were not updated to cover the new function.
Suggestions:
  - Add test cases for processData() in tests/utils.test.ts

Please address the issues above and continue working to fully complete the
original request.

14.6. Cost & Performance

AspectDetail
ModelUses smallModel (e.g., Haiku) for all judges — fast & cheap
ParallelismAll judges run simultaneously (Promise.allSettled)
Max cost3 judges × 3 rounds × ~100 output tokens = ~900 tokens per eval cycle
Retry2 retries per judge API call (vs 3 for main loop)
Graceful degradationIf eval API calls fail, the result is accepted without verification

14.7. When Eval is Skipped

Eval is NOT run in these cases:

ConditionReason
--eval not passedEval is opt-in to avoid overhead for simple tasks
Sub-agents (depth > 0)Only root agent needs verification
stop_reason ≠ end_turnmax_tokens, content_filter = abnormal stop, nothing to eval
Empty responseNo work to evaluate
evalRound ≥ MAX_EVAL_ROUNDSPrevent infinite refinement — accept after 3 tries
Eval API errorGraceful degradation — accept unverified result

14.8. LoopYield Events

Three new event types are emitted for UI consumption:

EventWhenPayload
eval_startBeginning of each eval roundround, judgeCount
eval_judge_verdictAfter each judge respondsverdict: { judgeName, isComplete, reasoning }, round
eval_completeAfter all judges are aggregatedpassed, round, refinementPrompt?

14.9. Usage

1
2
3
4
5
# Enable eval for a one-shot prompt
codingagent --eval -p "Refactor the auth module to use JWT tokens"

# Enable eval in interactive REPL
codingagent --eval

15. Skills & Project Memory

15.1. Overview

The skills & memory system loads instruction files from disk and injects them into the system prompt and/or user messages. It supports multiple ecosystems (Claude Code, GitHub Copilot, OpenAI Codex, Google Gemini) and provides a slash-command skill invocation mechanism.

15.2. Project Memory (CLAUDE.md)

Memory files are loaded hierarchically. Lower-priority entries appear earlier in the system prompt; higher-priority entries appear later (giving them implicit precedence).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
┌─────────────────────────────────────────────────────────────┐
│  MEMORY LOADING ORDER (priority ascending)                   │
│                                                             │
│  Priority 10 — User-level (global, personal):               │
│    ~/.claude/CLAUDE.md                                      │
│    ~/.codex/AGENTS.md           (OpenAI Codex)              │
│                                                             │
│  Priority 20 — Project-level (team-shared):                 │
│    ./CLAUDE.md  or  .claude/CLAUDE.md                       │
│    .github/copilot-instructions.md                          │
│    ./AGENTS.md                  (OpenAI Codex)              │
│    ./GEMINI.md                  (Google Gemini)             │
│                                                             │
│  Priority 25 — Modular rules (with optional path scoping):  │
│    .claude/rules/*.md                                       │
│    .github/instructions/*.instructions.md                   │
│                                                             │
│  Priority 30 — Local overrides (personal, not committed):   │
│    ./CLAUDE.local.md                                        │
└─────────────────────────────────────────────────────────────┘

Features:

  • Import support: @path/to/file lines in CLAUDE.md are inlined (up to 5 levels deep)
  • Path-scoped rules: YAML frontmatter paths: (Claude) or applyTo: (Copilot) restricts rules to specific file patterns
  • Truncation: Individual entries are capped at 10,000 chars to prevent context window exhaustion
  • Caching: Results are cached per project directory; reset on /clear, /reload, /resume

15.3. Skills (SKILL.md)

Skills are reusable specialist prompts that can be invoked as slash commands.

15.3.1. Skill Directory Loading

Skills are loaded from built-in default directories (always) plus any additional directories configured via skillDirs in settings.json.

1
2
3
4
5
6
7
8
9
10
11
12
13
┌─────────────────────────────────────────────────────────────┐
│  SKILL LOADING ORDER (later wins on name collision)          │
│                                                             │
│  1. ~/.claude/skills/          — Built-in: user-level       │
│  2. .claude/skills/            — Built-in: project-level    │
│  3. config.skillDirs[0]        — Extra: from settings.json  │
│  4. config.skillDirs[1]        — Extra: from settings.json  │
│     ...                                                     │
│                                                             │
│  Directories 1 & 2 are ALWAYS loaded.                       │
│  Directories 3+ are loaded only if configured.              │
│  Later directories override earlier ones on name collision. │
└─────────────────────────────────────────────────────────────┘

Configure extra directories in ~/.claude/settings.json:

1
2
3
4
5
6
7
{
  "skillDirs": [
    "/shared/team-skills",
    "C:\\company\\prompts",
    "~/my-custom-skills"
  ]
}

Path resolution:

  • ~ prefix → expanded to user’s home directory
  • Absolute paths → used as-is
  • Relative paths → resolved against the current working directory

15.3.2. SKILL.md Format

Each skill is a markdown file with optional YAML frontmatter:

1
2
3
4
5
6
7
8
9
10
---
name: react-specialist
description: Senior React developer for modern React 19 patterns
disable-model-invocation: false
user-invocable: true
allowed-tools: [Read, Grep, Edit, Write]
context: inline
---

You are a senior React developer. When writing React code...
Frontmatter FieldTypeDefaultDescription
namestringParent directory nameSkill identifier (used as /name command)
descriptionstring"Skill: {name}"One-line description for /skills and system prompt
disable-model-invocationbooleanfalseIf true, skill is hidden from the system prompt
user-invocablebooleantrueIf false, skill can’t be triggered via /name
allowed-toolsstring[]allRestrict which tools the skill can use
context"inline" | "fork""inline""fork" runs in an isolated sub-agent
agentstring"general-purpose"Sub-agent type when context: fork

15.3.3. How Skills Reach the LLM

Skills are surfaced to the model through two mechanisms:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
┌─────────────────────────────────────────────────────────────┐
│  A. SYSTEM PROMPT (every API call)                           │
│                                                             │
│  getSystemPrompt()                                          │
│    └── getSkillDescriptions(cwd)                            │
│          └── Filters: disableModelInvocation === false       │
│          └── Appends to system prompt:                      │
│                                                             │
│              Available skills (can be invoked via slash      │
│              commands):                                     │
│              - /react-specialist: Senior React developer... │
│              - /python-pro: Senior Python developer...      │
│                                                             │
│  B. SLASH-COMMAND INVOCATION (on user trigger)              │
│                                                             │
│  User types: /react-specialist fix the useEffect hook       │
│    └── getSkill(cwd, "react-specialist")                    │
│    └── substituteArguments(instructions, args)              │
│          └── $ARGUMENTS → "fix the useEffect hook"          │
│          └── $0 → "fix", $1 → "the", etc.                  │
│    └── Expanded instructions become the user message        │
│    └── Falls through to the agentic loop                    │
└─────────────────────────────────────────────────────────────┘

15.3.4. Tab-Completion & Hints Registration

At startup (and after /clear, /reload):

1
2
3
4
5
syncSkillCommands(cwd)
  └── getInvocableSkills(cwd)     — filters userInvocable === true
  └── registerSkillCommands()      — registers for tab-completion,
                                     inline hints, /help, and
                                     fuzzy command suggestions

15.4. Cache Lifecycle

EventMemory CacheSkills Cache
StartupLoaded from diskLoaded from disk
/clearresetMemoryCache() — re-read on next promptReset — re-read + re-sync commands
/reloadResetReset + re-sync commands
/resumeReset (implicit)Reset (implicit)
CWD changeRe-read (different cachedProjectDir)Re-read (different cachedProjectDir)

This document is maintained alongside the source code. When modifying core systems (loop, tools, caching, agents, eval, skills), update the relevant section.

This post is licensed under CC BY 4.0 by the author.