Post

How Coding Agents Work: From Raw LLMs to Autonomous Agents

A comprehensive technical report tracing the six-layer stack from next-token prediction to multi-agent orchestration that powers modern coding agents.

How Coding Agents Work: From Raw LLMs to Autonomous Agents

How Coding Agents Work: From Raw LLMs to Autonomous Agents

A Comprehensive Technical Report

February 2026 · Research compiled from academic papers, engineering blogs, and product documentation.


Executive Summary

Coding agents represent one of the most impactful applications of large language models. They have evolved from simple code-completion engines into autonomous systems that can navigate codebases, write and edit files, execute commands, debug failures, and ship working software — sometimes with minimal human intervention.

This report traces the full evolution from raw LLMs doing next-token prediction, through the key enablers — prompt engineering, chain-of-thought reasoning, tool use, and agentic loops — up to the multi-agent orchestration systems used in production today. Each layer builds on the previous one, forming a six-layer stack:

1
2
3
4
5
6
7
8
9
10
11
12
13
┌─────────┬────────────────────────────┬──────────────────────────────────────────────┐
│ Layer 6 │ MULTI-AGENT ORCHESTRATION  │ Coordinator + specialized workers, parallel  │
├─────────┼────────────────────────────┼──────────────────────────────────────────────┤
│ Layer 5 │ AGENTIC LOOP (ReAct)      │ while(not_done): think → act → observe       │
├─────────┼────────────────────────────┼──────────────────────────────────────────────┤
│ Layer 4 │ TOOL USE / FUNCTION CALL  │ File I/O, terminal, search, browser, APIs    │
├─────────┼────────────────────────────┼──────────────────────────────────────────────┤
│ Layer 3 │ SCAFFOLDING               │ Memory, context management, planning, safety │
├─────────┼────────────────────────────┼──────────────────────────────────────────────┤
│ Layer 2 │ PROMPT ENGINEERING        │ System prompts, few-shot, CoT instructions   │
├─────────┼────────────────────────────┼──────────────────────────────────────────────┤
│ Layer 1 │ RAW LLM                   │ Next-token prediction on code corpora        │
└─────────┴────────────────────────────┴──────────────────────────────────────────────┘

Table of Contents

  1. Layer 1: The Raw LLM — Next-Token Prediction on Code
  2. Layer 2: Prompt Engineering for Code
  3. Layer 3: Chain-of-Thought Reasoning
  4. Layer 4: Tool Use and Function Calling
  5. Layer 5: The Agentic Loop — ReAct and Beyond
  6. Layer 6: Scaffolding, Memory, and Context Management
  7. Real-World Coding Agent Architectures
  8. Multi-Agent Orchestration
  9. The Edit-Test-Debug Feedback Loop
  10. Key Takeaways and the Road Ahead
  11. References

Layer 1: The Raw LLM — Next-Token Prediction on Code

The Fundamental Mechanism

At their core, all code-generating LLMs work by next-token prediction. Given a sequence of tokens, the model predicts a probability distribution over the vocabulary for the next token. This seemingly simple mechanism, when applied at scale with billions of parameters trained on massive code corpora, produces remarkably coherent and functional code.

The generation pipeline works as follows:

1
2
3
4
5
6
7
8
9
10
11
┌─────────────┐   ┌───────────┐   ┌────────────┐   ┌──────────────────────────────┐
│ Source Code  │──>│ Tokenizer │──>│ Embeddings │──>│ Transformer (N layers of     │
└─────────────┘   └───────────┘   └────────────┘   │ self-attention)              │
                                                    └──────────────┬───────────────┘
                                                                   │
                                                                   ▼
┌──────────┐   ┌────────────────┐   ┌──────────────────────────────────────────────┐
│  repeat  │<──│ append to seq  │<──│ Prediction Head (linear + softmax)           │
└────┬─────┘   └────────────────┘   │ → Next Token (greedy / top-p / temperature)  │
     │                              └──────────────────────────────────────────────┘
     └──────────────────────────────────────────────────────────────────────────>↑
  1. Tokenization: Source code is broken into tokens — keywords, identifiers, operators, whitespace, and special characters. Subword tokenizers (BPE) handle rare identifiers by splitting them into known fragments.
  2. Embedding: Each token is mapped to a high-dimensional vector (e.g., 4096 dimensions).
  3. Transformer Processing: Multiple layers of self-attention build contextual representations. Each token “attends” to all previous tokens, capturing syntactic relationships (matching braces, variable scoping) and semantic patterns (algorithmic idioms).
  4. Prediction Head: A final linear layer + softmax produces a probability distribution over the entire vocabulary (~32K–128K tokens).
  5. Decoding: The next token is selected via greedy search, temperature-based sampling, or nucleus (top-p) sampling.
  6. Autoregressive Loop: The selected token is appended to the sequence and the process repeats, generating code one token at a time.

Why Next-Token Prediction Works for Code

Code is highly structured with lower entropy than natural language in many contexts:

  • Strict syntax: Programming languages have formal grammars. After def foo(, the model strongly expects parameter names and type annotations.
  • Repetitive patterns: Common idioms (iterating over lists, error handling, CRUD operations) appear millions of times in training data.
  • Local dependencies: Variable names, function calls, and import statements create predictable patterns within a file.
  • NL↔PL bridge: Docstrings, comments, and descriptive variable names create a natural mapping between intent and implementation.

Landmark Code LLMs

ModelYearParametersTraining DataHumanEval ScoreKey Innovation
Codex (OpenAI)202112B159GB Python from 54M GitHub repos28.8% → 37.7%First NL→code generation; created HumanEval benchmark; powered GitHub Copilot
CodeT5 (Salesforce)2021220MCodeSearchNet (6 languages)Encoder-decoder with identifier-aware denoising
Code Llama (Meta)20237B–34B500B+ code tokens53%Long context (16K), infilling, self-instruct fine-tuning
DeepSeek Coder20241.3B–33B2T tokens (87% code)79% (33B)Fill-in-the-blank + repo-level training
Claude 3.5 Sonnet2024UndisclosedUndisclosed92%State-of-the-art reasoning + tool use integration

Sources: Chen et al., 2021 · Rozière et al., 2023 · Towards Data Science

Limitations of Raw LLMs

A raw LLM generating code token-by-token has fundamental limitations:

  • No execution feedback: It cannot run its own code, see errors, or iterate.
  • No file access: It cannot read existing codebases or write files.
  • Hallucination: It invents plausible-looking but incorrect APIs, variable names, or logic.
  • Context limits: It can only “see” what fits in its context window (historically 2K–8K tokens, now 128K–200K).
  • One-shot generation: It produces output in a single pass with no ability to revise.

Each subsequent layer in the stack addresses these limitations.


Layer 2: Prompt Engineering for Code

Prompt engineering is the art of structuring inputs to extract better outputs from an LLM without changing the model itself. For code generation, several techniques have proven effective:

System Prompts

System prompts define the model’s persona, constraints, and output requirements. They are the most impactful lever for code quality:

1
2
3
4
5
6
You are a senior Python developer. Follow these rules:
- Use Python 3.12+ features (match statements, type unions with |)
- Add type hints to all function signatures
- Write Google-style docstrings
- Handle errors explicitly — never use bare except
- Return only the code, no explanations

Zero-Shot vs. Few-Shot

TechniqueDescriptionBest For
Zero-shotJust describe the task; no examplesSimple, well-defined tasks
Few-shotProvide 2–5 input→output examplesPattern-following, format adherence, edge cases

Few-shot prompting significantly improves format adherence and helps the model understand expected patterns. The quality and diversity of examples matters more than quantity.

Structured Output

Techniques to constrain the model’s output into predictable formats:

  • JSON mode: Force the model to output valid JSON (OpenAI’s response_format: { type: "json_object" }).
  • Schema-based: Define the exact structure of the expected output using JSON Schema.
  • Delimiters: Use fenced code blocks ( ``python `) to delineate code from explanation.
  • Prompt templates: Parameterized prompts with placeholders, separating input variables from task logic from output format.

Best Practices

  1. Be specific about language version, framework, and coding conventions.
  2. Provide context — existing code, file structure, imports, and related functions.
  3. Constrain the output — specify what format, what to include, what to omit.
  4. Use role prompts — “You are a senior backend engineer at a fintech company…”
  5. Include negative examples — show what NOT to do alongside what to do.

Sources: DataCamp · Real Python


Layer 3: Chain-of-Thought Reasoning

What is Chain-of-Thought?

Chain-of-Thought (CoT) prompting asks the model to generate intermediate reasoning steps before producing the final answer. Instead of jumping directly from problem statement to code, the model “thinks aloud” — decomposing the problem, identifying the algorithm, considering edge cases, and then implementing.

1
2
3
4
5
6
7
8
9
10
11
12
Task: Write a function to find the longest palindromic substring.

Let me think step by step:
1. A palindrome reads the same forwards and backwards.
2. Brute force: check all substrings — O(n³). Too slow.
3. Better approach: expand around centers — O(n²).
4. Each character and each pair of adjacent characters is a potential center.
5. For each center, expand outward while characters match.
6. Track the longest palindrome found.

def longest_palindrome(s: str) -> str:
    ...

Structured Chain-of-Thought (SCoT) for Code

Li et al. (2023) introduced Structured CoT (SCoT), which exploits the fact that all code can be decomposed into three fundamental structures: sequence, branch, and loop. Instead of free-form reasoning, SCoT asks the model to structure its intermediate steps using these programming constructs.

Results: SCoT outperformed standard CoT by up to 13.79% in Pass@1 across HumanEval, MBPP, and MBCPP benchmarks. Human evaluators also preferred SCoT-generated programs.

Long Chain-of-Thought (Extended Thinking)

Modern reasoning models (OpenAI o1/o3, DeepSeek-R1, Claude with extended thinking) use long CoT — spending thousands of tokens reasoning before generating code. This enables:

  • Exploring and discarding multiple approaches
  • Catching logical errors before they become code
  • Verifying correctness through mental simulation
  • Handling multi-step problems that require planning

When CoT Helps (and When It Doesn’t)

Task ComplexityCoT Benefit
Simple (string formatting, basic CRUD)Minimal — adds unnecessary overhead
Medium (data structures, algorithms)Moderate improvement
Complex (system design, multi-step logic)Significant improvement
Adversarial (tricky edge cases)Critical for correctness

Sources: Li et al., 2023 · OpenReview: Revisiting CoT


Layer 4: Tool Use and Function Calling

The Breakthrough: From Text to Actions

Tool use (function calling) is the capability that transforms an LLM from a text generator into an agent. Instead of only producing text, the model can generate structured requests to execute external functions — reading files, running commands, searching the web, calling APIs.

How Function Calling Works

The technical flow has five steps:

1
2
3
4
5
6
┌──────────┐     ┌───────────┐     ┌──────────┐     ┌───────────┐     ┌──────────┐
│  Define  │────>│  Model    │────>│ Execute  │────>│  Return   │────>│  Final   │
│  Tools   │     │  Decides  │     │ Function │     │  Results  │     │ Response │
│ (schema) │     │ (call or  │     │ (app     │     │ (to model)│     │ (text)   │
│          │     │  respond) │     │  code)   │     │           │     │          │
└──────────┘     └───────────┘     └──────────┘     └───────────┘     └──────────┘
  1. Define tools: Provide JSON Schema descriptions of available functions — name, description, parameters with types.
  2. Model decides: Given the prompt and tool definitions, the model either generates a direct text response OR returns one or more tool calls with structured JSON arguments.
  3. Execute: Application code parses the tool call, invokes the actual function, and captures the result.
  4. Return results: The function output is sent back to the model as a message with role "tool".
  5. Final response: The model incorporates the tool result and generates the next response (which may include more tool calls).

Example: OpenAI Function Calling

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Step 1: Define the tool
tools = [{
    "type": "function",
    "function": {
        "name": "read_file",
        "description": "Read the contents of a file at the given path.",
        "parameters": {
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": "Absolute path to the file"
                },
                "offset": {
                    "type": "integer",
                    "description": "Line number to start reading from"
                },
                "limit": {
                    "type": "integer",
                    "description": "Number of lines to read"
                }
            },
            "required": ["path"]
        },
        "strict": True
    }
}]

# Step 2: Model decides to call the tool
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": "What does the main() function do in app.py?"}],
    tools=tools
)
# response.choices[0].message.tool_calls = [
#   { "function": { "name": "read_file", "arguments": '{"path": "app.py"}' } }
# ]

# Step 3: Application executes the function
file_content = read_file("app.py")

# Step 4: Return result to model
messages.append({"role": "tool", "content": file_content, "tool_call_id": "..."})

# Step 5: Model generates final answer
final = client.chat.completions.create(model="gpt-4.1", messages=messages, tools=tools)

Key Technical Details

  • Tool definitions count as input tokens — they are injected into the system prompt in a special format the model was trained to understand.
  • Models are fine-tuned specifically to understand when to call functions and how to generate valid JSON arguments.
  • Parallel tool calls: Modern models can request multiple function calls in a single turn (e.g., reading three files simultaneously).
  • Strict mode: Ensures the model’s JSON arguments conform exactly to the schema (no extra fields, correct types).
  • Function descriptions are critical — they are the model’s only understanding of what a tool does. Poorly described tools lead to poor usage.

The Coding Agent Tool Kit

For coding agents specifically, the tool set typically includes:

CategoryToolsPurpose
File I/Oread_file, write_file, edit_file, list_directoryNavigate and modify the codebase
Searchgrep, glob, find_references, go_to_definitionDiscover relevant code
Executionrun_command, run_tests, buildExecute and validate code
Webweb_search, fetch_urlAccess documentation and examples
Planningcreate_plan, update_task, todo_listTrack multi-step work
Sub-agentsspawn_agent, delegate_taskParallelize complex work

Sources: OpenAI Function Calling Guide · Analytics Vidhya


Layer 5: The Agentic Loop — ReAct and Beyond

The Core Insight

A raw LLM with tool access can call one function and respond. But to solve real coding tasks, it needs to call tools iteratively — reading code, making edits, running tests, reading errors, fixing bugs, and repeating. This requires a loop.

The ReAct Pattern

ReAct (Reasoning + Acting), introduced by Yao et al. (2022), is the foundational pattern for coding agents. The model interleaves reasoning traces with actions in a loop:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
┌─────────────────────────────────────────────────────────────────┐
│                        ReAct Loop                               │
│                                                                 │
│   ┌───────────┐      ┌───────────┐      ┌─────────────┐        │
│   │  THOUGHT  │─────>│  ACTION   │─────>│ OBSERVATION  │        │
│   │           │      │           │      │              │        │
│   │ "I need   │      │ read_file │      │ [file        │        │
│   │  to read  │      │ edit_file │      │  contents,   │        │
│   │  the code │      │ run_cmd   │      │  cmd output, │        │
│   │  first…"  │      │ search    │      │  errors…]    │        │
│   └───────────┘      └───────────┘      └──────┬──────┘        │
│        ▲                                        │               │
│        │            ┌──────────┐                │               │
│        └────────────│ NOT DONE │◄───────────────┘               │
│                     └──────────┘                                │
│                          │                                      │
│                     ┌────▼─────┐                                │
│                     │   DONE   │──────> Final Response           │
│                     └──────────┘                                │
└─────────────────────────────────────────────────────────────────┘
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Thought 1: I need to understand the current code before making changes.
            Let me read the main file.
Action 1:  read_file("src/main.py")
Observation 1: [file contents...]

Thought 2: The bug is on line 42 — the comparison uses == instead of >=.
            I need to fix this and also update the test.
Action 2:  edit_file("src/main.py", line=42, old="==", new=">=")
Observation 2: File edited successfully.

Thought 3: Now let me run the tests to verify the fix.
Action 3:  run_command("pytest tests/ -v")
Observation 3: 15 passed, 0 failed.

Thought 4: All tests pass. The fix is correct.
Action 4:  [Final response to user]


Why ReAct Works for Coding

ComponentRole in Coding
ThoughtPlan the approach, diagnose errors, consider alternatives
ActionRead files, edit code, run commands, search
ObservationReceive file contents, command output, test results, error messages

The key insight is that reasoning without actions leads to hallucination (the model invents plausible but wrong code), while actions without reasoning leads to random trial-and-error. ReAct combines both.

The Agentic Loop in Practice

Every coding agent implements some variant of this loop:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def agent_loop(user_request: str, tools: list, llm: LLM) -> str:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_request}
    ]

    while True:
        # Ask the LLM what to do next
        response = llm.generate(messages, tools=tools)

        if response.has_tool_calls():
            # Execute each tool call
            for tool_call in response.tool_calls:
                result = execute_tool(tool_call.name, tool_call.arguments)
                messages.append({
                    "role": "tool",
                    "content": str(result),
                    "tool_call_id": tool_call.id
                })
        else:
            # No tool calls = agent is done
            return response.text

This is the fundamental building block of all coding agents. Claude Code calls it the “nO loop.” SWE-Agent calls it the forward() method. GitHub Copilot’s agent mode describes it as “iterating until reaching a final state.” The names differ, but the pattern is universal.

Variations on the Loop

PatternDescriptionUsed By
Simple ReActSingle loop, one modelClaude Code, SWE-Agent
Plan-then-ExecuteGenerate a full plan first, then execute stepsDevin, MetaGPT
Architect/EditorOne model reasons, another edits codeAider
HierarchicalLead agent delegates subtasks to sub-agentsClaude Research, OpenHands
State MachineFinite-state transitions between defined phasesLangGraph agents

Sources: Yao et al., 2022 · Prompt Engineering Guide


Layer 6: Scaffolding, Memory, and Context Management

What is Agent Scaffolding?

Scaffolding is the software architecture built around an LLM to enable it to perform complex, goal-driven tasks. It includes prompt templates, memory systems, tool interfaces, control flow, safety guardrails, and feedback mechanisms. The metaphor is apt: just as construction scaffolding supports a building under construction, agent scaffolding supports the LLM as it works.

Memory Systems

Coding agents need memory that extends beyond the current context window:

Short-Term Memory (The Context Window)

The conversation history — all messages, tool calls, tool results, and reasoning traces — within the current context window. This is the agent’s “working memory.”

Challenge: Context Bloat. As the agent iterates, the context fills with irrelevant past errors, verbose file contents, and failed approaches. This increases cost, latency, and degrades reasoning quality.

Long-Term Memory

Mechanisms to persist information across and beyond context windows:

ApproachDescriptionUsed By
Markdown filesSimple text files storing project notes, conventions, and learned patternsClaude Code (CLAUDE.md files)
Vector databasesEmbedding-based retrieval of past interactions and code snippetsCursor, some RAG-based agents
Conversation summariesCompressed versions of past interactionsClaude Code’s “Compressor wU2”
Plan persistenceSaving TODO lists and plans to diskClaude Code (TodoWrite)
Repository indexingPre-computed index of codebase structureDevin, Cursor

Context Management Strategies

The most critical engineering challenge in coding agents is managing the finite context window:

Claude Code’s approach: “Compressor wU2” triggers at ~92% context utilization. It summarizes the conversation, preserves critical information (current task state, file locations, key decisions), and moves details to Markdown-based long-term storage. This is a pragmatic choice — simple files over complex vector databases — prioritizing reliability and debuggability.

SWE-Agent’s approach: A HistoryProcessor compresses conversation history to fit context windows. It uses configurable retention policies to keep recent and relevant history while truncating older turns.

Multi-agent approach (Anthropic Research): Subagents operate in separate context windows, effectively multiplying the available context. Completed work is summarized before handoff. Fresh subagents with clean contexts can be spawned while maintaining continuity through structured summaries.

Planning and Task Tracking

Complex coding tasks require explicit planning:

1
2
3
4
5
6
7
8
// Claude Code's TodoWrite format
[
  { "id": "1", "content": "Read and understand the failing test", "status": "completed" },
  { "id": "2", "content": "Find the root cause in auth.py", "status": "in_progress" },
  { "id": "3", "content": "Implement the fix", "status": "pending" },
  { "id": "4", "content": "Run test suite to verify", "status": "pending" },
  { "id": "5", "content": "Check for regressions", "status": "pending" }
]

The current plan is injected as a system message after each tool use, keeping the agent oriented even as the context grows.

Safety Guardrails

Production coding agents implement multiple safety layers:

  • Permission systems: Separate approval tiers for read, write, and execute operations.
  • Command sanitization: Risk-level classification for shell commands (safe: ls, cat; risky: rm, sudo).
  • Max iteration limits: Prevent infinite loops (typically 20–50 turns).
  • Diff-first workflow: Show proposed changes before applying them.
  • Sandboxing: Run code in isolated Docker containers (SWE-Agent, OpenHands).
  • Kill switches: Allow humans to abort runaway agents.

Sources: ZBrain · ZenML


Real-World Coding Agent Architectures

Claude Code (Anthropic)

Architecture: Single-threaded master loop — deliberately simple.

1
2
3
User Input → System Prompt + Tools → Claude API → Tool Calls? 
    ├── Yes → Execute Tools → Append Results → Loop Back ↑
    └── No  → Return Text Response to User

Design philosophy: One flat message history, one main loop, no competing agent personas. Simplicity yields controllability.

Key components:

  • Tools: View, Edit, Write, Bash, Glob, GrepTool, LS, TodoWrite, Task (sub-agent)
  • Context management: Compressor at 92% utilization → Markdown long-term memory
  • Planning: TodoWrite with JSON task lists injected as system messages
  • Sub-agents: Task tool spawns independent agents with their own context windows for parallel work
  • Safety: Tiered permission system, command risk classification

Source: ZenML · PromptLayer


GitHub Copilot Agent Mode

Architecture: Orchestrator with deep IDE integration.

How it works:

  1. User provides a natural-language prompt in VS Code.
  2. The prompt is augmented with workspace context (file structure, open files, diagnostics).
  3. A system prompt instructs Copilot to keep iterating until reaching a final state.
  4. Copilot reads files, edits code, runs terminal commands, and detects syntax errors, test failures, and build errors.
  5. It course-corrects automatically based on feedback.

Two modes:

  • Agent Mode (in-IDE): Synchronous pair programming — edits, runs, debugs in real time.
  • Coding Agent (GitHub Actions): Asynchronous teammate — assign an issue, it creates a PR with full implementation.

Extensibility: Supports MCP (Model Context Protocol) servers for additional tool integration.

Source: GitHub Blog


SWE-Agent (Princeton University)

Architecture: LLM + Agent-Computer Interface (ACI) + Docker sandbox.

Key innovation — the ACI: An interface designed specifically for LLMs (not humans) to interact with computers:

  • Windowed file viewing: Shows ~100 lines at a time. Research showed agents get overwhelmed by more, just as humans do.
  • Built-in linter: Catches formatting errors immediately. 51.7% of SWE-Agent’s edits had at least one error caught by the linter before submission.
  • Explicit feedback: “Command ran successfully with no output” instead of empty responses.
  • Context indicators: Current file, line number, and working directory shown with every command response.

Components: SWEEnv (Docker environment) → Agent (LLM orchestrator with forward() method) → HistoryProcessor (context compression) → Parser (action extraction)

Performance: Solved 12.5% of SWE-bench tickets (4× better than raw LLM prompting).

Source: SWE-Agent Docs · Yang et al., 2024


Aider

Architecture: Terminal-based AI pair programmer with Architect/Editor separation.

Key insight: Separate code reasoning from code editing using two different models:

1
2
3
4
5
User Request → Architect Model (e.g., Claude Sonnet)
                  ↓ (high-level solution plan)
               Editor Model (e.g., GPT-4)
                  ↓ (precise code edits)
               Git Commit
  1. Architect model: Focuses on problem-solving, algorithm selection, and solution design. Uses the full reasoning capabilities of frontier models.
  2. Editor model: Translates the architect’s plan into specific, correct code edits in the repository’s existing style.

Works directly in git repositories — edits files and commits changes with meaningful messages. Over 80% of Aider’s own codebase was written by Aider itself.

Source: Aider Chat · GitHub: Aider


Devin (Cognition Labs)

Architecture: Fully autonomous AI software engineer with its own managed cloud environment.

  • Has its own code editor, web browser, and terminal in a cloud sandbox.
  • Interactive planning: Creates and updates plans as it works.
  • Repository indexing: Automatically indexes codebases, generating architecture diagrams and docs.
  • Handles multi-file changes, debugging, testing, and deployment end-to-end.

Source: Devin.ai · Cognition Blog


OpenHands (formerly OpenDevin)

Architecture: Open-source platform with modular agent types and Docker sandboxing.

  • CodeAct Agent (default): Uses code-based actions.
  • Browsing Agent: Specialized for web research.
  • Isolated Docker containers for all code execution.
  • Multi-LLM support: OpenAI, Anthropic, open-source models via Ollama.

Source: Wang et al., 2024


Cursor

Architecture: AI-native code editor (VS Code fork) with deep IDE integration.

  • Tab completion: Inline suggestions with codebase awareness.
  • Chat: Conversational coding with full context.
  • Agent mode: Autonomous multi-step execution — reads files, edits code, runs terminals, uses web search, detects and fixes errors iteratively.

Comparative Summary

AgentLoop TypeSandboxingSub-agentsIDE IntegrationPlanning
Claude CodeSimple while-loopOptionalYes (Task tool)TerminalTodoWrite JSON
GitHub CopilotOrchestratorVS Code terminalNoDeep (VS Code)Implicit
SWE-Agentforward() + ACIDockerNoNone (CLI)Implicit in prompts
AiderArchitect/EditorNoneNoTerminal/gitArchitect model
DevinPlan-ExecuteCloud sandboxYesOwn IDEInteractive plans
OpenHandsCodeAct loopDockerMultiple agent typesWeb UIImplicit
CursorIDE-integrated loopTerminalNoDeep (VS Code fork)Implicit

Multi-Agent Orchestration

Why Multi-Agent?

Single-agent systems hit fundamental limits:

  • Context window overflow: Complex tasks produce more history than fits in one context.
  • Parallelism: Many subtasks can be explored simultaneously.
  • Specialization: Different agents optimized for different roles (coding, testing, reviewing, researching).
  • Information compression: Each sub-agent distills findings into concise summaries.

The Orchestrator-Worker Pattern

The most proven production pattern, used by Anthropic’s Claude Research system:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
                    ┌─────────────────┐
                    │   Lead Agent    │
                    │  (Orchestrator) │
                    │   Claude Opus   │
                    └────────┬────────┘
                             │ spawns
              ┌──────────────┼──────────────┐
              │              │              │
      ┌───────▼──────┐ ┌────▼─────┐ ┌──────▼───────┐
      │  Sub-agent 1 │ │ Sub-agent │ │  Sub-agent N │
      │  (Sonnet)    │ │ 2 (Sonnet)│ │  (Sonnet)    │
      │  Research    │ │ Code impl │ │  Testing     │
      └───────┬──────┘ └────┬─────┘ └──────┬───────┘
              │              │              │
              └──────────────┼──────────────┘
                             │ results
                    ┌────────▼────────┐
                    │   Lead Agent    │
                    │  (Synthesizes)  │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │  Final Output   │
                    └─────────────────┘

Results from Anthropic: The multi-agent system with Opus lead + Sonnet workers outperformed single-agent Opus by 90.2% on internal research evaluations. Token usage explained 80% of performance variance — multi-agent architectures effectively scale token usage by giving each sub-agent its own context window.

Key Lessons from Production Multi-Agent Systems

  1. Teach the orchestrator to delegate well — Each sub-agent needs a clear objective, output format, tool guidance, and task boundaries.
  2. Scale effort to query complexity — Simple task = 1 agent with 3–10 tool calls. Complex task = 10+ sub-agents with parallel execution.
  3. Tool design is existential — Agents with poorly described tools fail fundamentally. An Anthropic tool-testing agent that rewrote tool descriptions achieved a 40% decrease in task completion time.
  4. Start wide, then narrow — Explore broadly before drilling into specifics.
  5. Parallel tool calling — Cut research time by up to 90% for complex queries.

Multi-Agent Frameworks

FrameworkCreatorArchitecture
AutoGenMicrosoftChatbot-style multi-agent conversations
LangGraphLangChainGraph-based stateful orchestration
MetaGPTAcademicAgents simulate a software team (PM, Architect, Engineer, QA)
CrewAICrewAITask-focused role-based coordination
Agent Development KitGoogleFormalized stateful orchestration

Source: Anthropic Engineering


The Edit-Test-Debug Feedback Loop

The defining capability that separates a coding agent from a raw LLM is the ability to iteratively refine its output. The edit-test-debug loop is where theory meets practice:

1
2
3
4
5
6
7
8
┌──────┐     ┌──────┐     ┌──────┐     ┌─────────┐
│ PLAN │────>│ EDIT │────>│ TEST │────>│ OBSERVE │
└──────┘     └──────┘     └──────┘     └────┬────┘
   ▲                                        │
   │              ┌───────┐                 │
   └──────────────│ DEBUG │<────────────────┘
                  └───────┘
                (if errors)

Evidence from Real Agent Behavior

Research on SWE-Agent’s behavior across SWE-bench reveals a clear pattern:

PhaseTurnsDominant Actions
Understanding1–2Search files, read code, navigate directory structure
Initial implementation2–5Edit files, run code to check changes
Refinement5–10Edit + run in tight loops, fixing errors
Submission~10Submit solution (if successful)

Key finding: The fewer turns an agent takes, the more likely it succeeds. Getting stuck in long iteration loops is the most common failure mode.

Self-Correction Strategies

Modern coding agents employ multiple self-correction mechanisms:

StrategyDescription
Linter feedbackImmediate syntactic error correction (catches 51.7% of errors in SWE-Agent)
Test-driven developmentRun the test suite after each edit; use failures to guide the next edit
Error message parsingRead compiler/runtime errors diagnostically to identify root causes
Bug reproduction firstReproduce the bug before attempting a fix, then verify the fix resolves it
Regression checkingEnsure fixes don’t break existing functionality

The Instruction Layer

Agents are guided by explicit instructions in their system prompts, mirroring advice a senior developer would give a junior:

  • “Always start by trying to replicate the bug.”
  • “If you run a command and it doesn’t work, try a different command.”
  • “When you think you’ve fixed the bug, re-run the bug reproduction script.”
  • “When editing files, it is easy to accidentally specify a wrong line number.”

These instructions encode software engineering wisdom into the agent’s behavior loop.

Source: Pragmatic Engineer


Key Takeaways and the Road Ahead

The Six-Layer Stack

Every coding agent, from the simplest to the most sophisticated, is built on the same foundational stack:

  1. Raw LLM: Next-token prediction provides the base capability to generate syntactically correct, semantically meaningful code.
  2. Prompt Engineering: System prompts, few-shot examples, and structured output constraints make the LLM’s output more reliable and predictable.
  3. Chain-of-Thought: Explicit reasoning steps improve the model’s ability to solve complex, multi-step coding problems.
  4. Tool Use: Function calling transforms the model from a text generator into an agent that can read files, execute commands, and interact with the real world.
  5. Agentic Loop: The ReAct pattern — think, act, observe, repeat — enables iterative problem-solving with self-correction.
  6. Multi-Agent Orchestration: Coordinating multiple agents with separate contexts enables tackling problems too complex for any single agent.

What Makes a Coding Agent Effective

The research reveals several consistent patterns across successful coding agents:

FactorEvidence
Simple architectureClaude Code’s single while-loop outperforms complex frameworks
Well-designed toolsSWE-Agent’s ACI (100-line windows, linter integration) dramatically improves performance
Iterative refinementThe edit-test-debug loop, not one-shot generation, is what produces working code
Context managementAgents that manage their context well (compression, summarization, sub-agents) handle larger tasks
PlanningExplicit task tracking (TODO lists, plans) prevents agents from losing their way
Token scalingMore tokens ≈ better results — multi-agent systems scale token usage effectively

Open Challenges

  • Long-horizon tasks: Agents still struggle with tasks requiring sustained coherence over hundreds of steps.
  • Codebase-scale reasoning: Understanding the full architecture of a large codebase remains beyond current capabilities.
  • Specification ambiguity: Agents often solve the wrong problem when requirements are vague.
  • Cost: Multi-agent systems use 15× more tokens than chat — effective but expensive.
  • Evaluation: Benchmarks (SWE-bench, HumanEval) don’t fully capture real-world software engineering complexity.
  • Safety: Autonomous code execution in production environments requires robust guardrails that don’t exist yet at scale.

The Trajectory

The trajectory is clear: coding agents are evolving from assistants (human-in-the-loop, single-turn) to collaborators (multi-turn, iterative) to autonomous teammates (asynchronous, task-to-PR). Each layer of the stack enables the next level of autonomy. The raw LLM provides the intelligence; everything built on top channels that intelligence into reliable, safe, and effective software engineering.


References

  1. Chen, M., et al. (2021). “Evaluating Large Language Models Trained on Code.” arXiv:2107.03374. Link
  2. Rozière, B., et al. (2023). “Code Llama: Open Foundation Models for Code.” Meta AI. Link
  3. Li, J., et al. (2023). “Structured Chain-of-Thought Prompting for Code Generation.” arXiv:2305.06599. Link
  4. Yao, S., et al. (2022). “ReAct: Synergizing Reasoning and Acting in Language Models.” arXiv:2210.03629. Link
  5. Yang, J., et al. (2024). “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering.” arXiv:2405.15793. Link
  6. Wang, X., et al. (2024). “OpenHands: An Open Platform for AI Software Developers as Generalist Agents.” arXiv:2407.16741. Link
  7. OpenAI. “Function Calling Guide.” OpenAI API Documentation. Link
  8. Anthropic. “How we built our multi-agent research system.” Anthropic Engineering Blog. Link
  9. GitHub. “Agent mode 101: All about GitHub Copilot’s agentic coding experience.” GitHub Blog. Link
  10. ZenML. “Claude Code Agent Architecture: Single-Threaded Master Loop for Autonomous Coding.” Link
  11. PromptLayer. “Claude Code: Behind the Scenes of the Master Agent Loop.” Link
  12. Pragmatic Engineer. “How do AI software engineering agents work?” Link
  13. ZBrain. “Agent Scaffolding Explained.” Link
  14. Prompt Engineering Guide. “ReAct Prompting.” Link
  15. Aider. “Separating code reasoning and editing.” Link
  16. Towards Data Science. “Cracking the Code LLMs.” Link

Report generated February 2026. All citations verified at time of research.

This post is licensed under CC BY 4.0 by the author.