AI-Assisted Development · Brownfield

The Codebase Archaeology Phase Most AI Workflows Skip

An AI agent reads the files. It does not recover the theory behind them. On a brownfield codebase, that gap is where the confidently wrong PRs come from.

Three layers of archaeology The comprehension memo Where AI cannot go alone

How we engineer AI-assisted development Read the previous post in this series

Michael Feathers, 2004: "Legacy code is code without tests." That definition has held for 22 years because it pointed at the real problem. The hard part of changing old code is not the change, it is knowing what you broke.

In 2026, with AI coding agents writing or editing a growing share of every legacy codebase, the definition needs one extension. Legacy code is also code without a recoverable understanding of why it behaves the way it does. Point an agent at it and you get a plausible plan and a confidently wrong PR, written against assumptions the agent invented because the file did not say.

The discipline this post is about

Codebase archaeology

Recovering a system's lost understanding, why the code behaves the way it does, before you change it. Most AI workflows skip it.

Reading is not a step. On a system someone has depended on for a decade, it is a deliverable.

Robert Glass's Facts and Fallacies of Software Engineering documented that understanding existing code consumes roughly 30% of total maintenance time on its own, with broader comprehension activities pushing past 60%. Your team is already spending most of its legacy hours on reading, before any AI is in the loop.

AI coding agents have the opposite bias. They are tuned to generate, so without intervention they read just enough to produce confident output and stop.

Anthropic's published Explore to Plan to Code to Commit workflow puts exploration first, and Claude Code's Plan Mode makes it explicit. The published guidance treats exploration as a session activity. On a brownfield engagement, that is not enough.

Exploration has to produce an artefact that survives the session, the agent, and the engineer who ran it. That is the difference between a step and a phase. A phase produces something the next person on your team can read.

Archaeology has three layers. AI does well at two of them and is fundamentally limited on the third.

Three layers, three things you are reading

Layer 1: The code as it stands now

What you can read when you browse the repository: the architecture, the layer boundaries, the dependency direction, the call graph touching the change site.

This is where AI is strongest. Claude Code with LSP, Cursor with workspace context, and GitHub Copilot's @workspace produce a usable map of the current structure in minutes. Anthropic's enterprise guide names LSP as the infrastructure that makes large-codebase AI work usable in C, C++, and Java repositories where naive grep is unusable.

The failure mode is the simplest one. The agent reads what is in front of it and assumes that is the whole picture.

Layer 2: The history of how the code got there

The commit history carries signal the current snapshot does not: files that always change together but live three folders apart, modules that have been rewritten four times in two years, dead code nobody dares delete because the original author left.

This layer is invisible to a static read. The agent will not infer it from the files alone. It has to be fed the history, either through git-log tooling or behavioral analysis pipelines that surface the coupling.

AI can interpret this layer well when given the data. It cannot generate the data on its own.

Layer 3: The understanding that never made it to the file

Examples that come up on every engagement: the invariant a regulator depends on, the audit-log write that has to happen before the database commit, the reason the validator lives in a weird place, the bug that turned out to be load-bearing.

Feathers' observation, written for human engineers in 2004, applies here without modification. This is the layer that survives in the heads of the people who built the system, and is lost when they leave. The code is a partial record of it, not a complete one.

His answer was characterization tests: tests written not to assert what the code should do, but to capture what it currently does, so a future engineer can change it safely. A characterization test is archaeology in executable form. You run the code, observe the output, and write it down.

The act of writing it down is what makes the understanding survive.

AI accelerates this work. Hand an agent a function and ask it to propose a characterization test suite, and it will produce a plausible draft in minutes. What it cannot do alone is decide which behaviors are the load-bearing ones, the ones that must be preserved versus the accidents of history that should not.

Layer 1

The code as it stands now

What you are reading: architecture, dependencies, call graph

Where AI helps: fast and accurate with an LSP

Where AI cannot go alone: misses what is not in front of it

Layer 2

The history of how the code got there

What you are reading: coupling, churn, dead code

Where AI helps: capable when paired with git-log data

Where AI cannot go alone: cannot generate the data itself

Layer 3

The understanding that never made it to the file

What you are reading: invariants, audit obligations, load-bearing bugs

Where AI helps: drafts characterization tests at speed

Where AI cannot go alone: cannot decide what is load-bearing

The comprehension memo: what it contains and what it is not

The output of archaeology is a written artefact, 1 to 3 pages, the comprehension memo. It names what the system does, where the change site lives, the invariants that must hold after the change, the historical signal pulled from commit data, and the open questions nobody on the current team can answer.

What the comprehension memo names

The system: layer structure, dependency direction, what the change site does.
The change site: what this file is, every place it is invoked from.
The invariants: what must remain true after the change, the ones the tests cover and the ones they do not.
The historical signal: which files always change together with this one, and what commit history says about why.
The open questions: what nobody on the current team can answer, and the smallest characterization test suite that would pin the behavior anyway.

The agent's context window is ephemeral. The memo is durable. The next session, the next engineer on your team, the next agent reads the memo first.

This is different from a CLAUDE.md or copilot-instructions.md file, which the previous post in this cluster covered in depth. Those are repository-wide and standing. The memo is engagement-specific and change-area-specific.

The repo file is the standing brief. The memo is the working brief for a specific change. Both are durable, both are written down, both outlive the engineer who created them.

Depth scales to risk. A typo fix gets one paragraph. A payments refactor gets three pages plus a characterization test suite for the hot files.

The comprehension memo is the written form of what would otherwise live in one engineer's head and leave when they do.

AI accelerates the visible layers. It exposes the invisible one.

For Layer 1, the current state of the code, AI is fast and accurate when given an LSP and a clear scope. Claude Code, Claude Opus 4.x in extended-thinking mode, and Cursor's agent mode all produce useful structural reads. The acceleration is real and worth taking.

For Layer 2, the historical signal, AI is mediocre alone and capable when paired with git-log tooling it can interpret. Hand it the right query output and it can name patterns the static read missed.

For Layer 3, the unwritten understanding, AI is fundamentally limited, and bigger context windows do not solve the limitation. Research on long-context recall (Liu et al., "Lost in the Middle," TACL 2024) found that language models retrieve information from the middle of long contexts significantly worse than information at the beginning or end, even on models explicitly designed for long context. Information that was never in the artefact in the first place cannot be retrieved by reading more of the artefact.

The honest framing: AI does not remove the archaeology phase. It accelerates the visible layers and exposes the invisible one as the residual your team still has to handle. That is a real improvement, and it is also not a substitute for engineer time on the question of what was true that nobody wrote down.

Honest limitations

Skip archaeology on throwaway prototypes, isolated greenfield modules, and 30-minute typo fixes. The overhead is not earned when the price of being wrong is low and the surface area is small.

Where the discipline pays off is brownfield code multiplied by the price of being wrong. Veracode's 2025 GenAI Code Security Report, analyzing the code of more than 100 large language models across Java, JavaScript, Python, and C#, found that 45% of AI-generated code contained security vulnerabilities, with AI-written code producing flaws at 2.74 times the human rate. That is the price of skipping archaeology in exactly the conditions where the bar should be production-grade.

The rule: archaeology depth scales with codebase age multiplied by the consequence of getting it wrong.

Both high, full memo plus characterization tests. Both low, skip it. One high and one low, calibrate.

The first deliverable, before a line changes

The first deliverable on every AnAr legacy modernization engagement is a written comprehension memo, before any code changes. Not a kickoff doc. An artefact the client can audit and the next engineer can read cold.

Characterization tests follow for the hot files. The two together are what our team hands over before we touch the production code.

This is the first move of the Understand phase in the four-phase process we run on client codebases. It is the move the previous post in this cluster hinted at when it described what an agent must know to edit a 12-year-old codebase. This post is how a team produces that knowledge in the first place.

Feathers wrote about humans changing legacy code in 2004. AI agents are now the dominant author of legacy-code changes. His discipline matters more, not less.

AnAr Solutions builds and modernizes software with AI in the loop on every engagement. Our engineers run the comprehension memo and characterization tests as the first deliverables on every legacy modernization engagement, before a line changes.

The Codebase Archaeology Phase Most AI Workflows Skip

Codebase archaeology

Reading is not a step. On a system someone has depended on for a decade, it is a deliverable.

Three layers, three things you are reading

Layer 1: The code as it stands now

Layer 2: The history of how the code got there

Layer 3: The understanding that never made it to the file

The code as it stands now

The history of how the code got there

The understanding that never made it to the file

The comprehension memo: what it contains and what it is not

AI accelerates the visible layers. It exposes the invisible one.

Honest limitations

The first deliverable, before a line changes

Recent Posts