Post

Does AGENTS.md Really Help? - Analysis of a Paper Verifying the Impact of Context Files on Coding Agents

Does AGENTS.md Really Help? - Analysis of a Paper Verifying the Impact of Context Files on Coding Agents
Prerequisites — Read these first
TL;DR — Key Takeaways
  • While context files like AGENTS.md and CLAUDE.md have been adopted by over 60,000 repositories, LLM-generated context files can actually degrade performance and increase costs by more than 20%.
  • The primary cause is redundancy with existing documentation — when documents were removed, context files showed a +2.7% performance improvement, proving that information overload is the issue.
  • Only minimal context files directly written by developers show a slight positive effect, and specific tooling requirements are more useful than general codebase overviews.
Visitors

Introduction

When using Claude Code, writing a CLAUDE.md file has become a near-default workflow. OpenAI’s Codex recommends AGENTS.md, and Anthropic’s Claude Code suggests CLAUDE.md. These context files have now been adopted by over 60,000 open-source repositories.

But do these files really improve agent performance? Researchers from ETH Zurich and Anthropic conducted the first large-scale empirical study on this question. The results are quite different from industry conventional wisdom.

Paper: “Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?” Authors: Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, Martin Vechev (ETH Zurich / Anthropic) arXiv: 2602.11988


1. Background: What are Context Files?

Context files for coding agents are Markdown files located at the repository root. They serve to inform the agent about the project structure, build methods, coding rules, and more.

AgentFilenameDeveloper
Claude CodeCLAUDE.mdAnthropic
CodexAGENTS.mdOpenAI
Cursor.cursorrulesCursor
Windsurf.windsurfrulesCodeium

Common components of these files:

  • Codebase Overview: Directory structure, descriptions of key modules.
  • Build/Test Commands: npm test, bundle exec jekyll serve, etc.
  • Coding Conventions: Naming rules, style guides.
  • Project-Specific Requirements: Use of specific tools, precautions.

The industry has expected that providing these files would help agents understand projects better and generate more accurate code. But is that actually the case?


2. Experimental Design

2-1. Evaluation Pipeline

Figure 1 - Evaluation Pipeline Overview Figure 1. Overview of the evaluation pipeline. The agent solves tasks under three conditions (Developer-written context / No context / LLM-generated context), and the generated patches are evaluated with unit tests.

The researchers compared three conditions:

  1. None — Working without a context file.
  2. LLM — Providing a context file automatically generated by an LLM.
  3. Human — Providing a context file directly written by a developer.

2-2. Benchmarks

SWE-bench Lite (Existing Benchmark)

  • 300 tasks from 11 popular Python repositories.
  • Well-known projects like Django, scikit-learn, and sympy.
  • Since original repositories lacked context files, only None vs. LLM could be compared.

AGENTbench (Newly created for this paper)

  • 138 instances from 12 niche Python repositories where developer-written context files actually exist.
  • Allows comparison across all three conditions: None, LLM, and Human.
  • Carefully curated from 5,694 PRs.

Figure 2 - AGENTbench Instance Distribution Figure 2. Distribution of AGENTbench instances across 12 open-source repositories. Compared to SWE-bench, it is more evenly distributed across repositories rather than being biased toward specific ones.

AGENTbench Statistics

MetricAverageMinimumMaximum
PR Body Length (words)415.354,961
Issue Description (words)211.696500
Codebase File Count3,33715126,602
PR Patch Line Count118.9121,973
Files Modified2.5123
Test Coverage75%2.5%100%
Context File Length (words)641.0242,003
Context File Section Count9.7129

2-3. Agents Tested

AgentModelDeveloper
Claude CodeSonnet-4.5Anthropic
CodexGPT-5.2OpenAI
CodexGPT-5.1 miniOpenAI
Qwen CodeQwen3-30b-coderAlibaba

They evaluated 4 major coding agents across 2 benchmarks and 3 conditions — testing a total of 20 experimental combinations.


3. Key Results: The Context File Paradox

3-1. Impact on Performance

Figure 3 - SWE-bench Lite Resolve Rate Figure 3a. Resolve rate on SWE-bench Lite. Across all four models, the LLM-generated context file (orange) shows an equal or lower success rate compared to no context (blue).

Figure 3 - AGENTbench Resolve Rate Figure 3b. Resolve rate on AGENTbench. Developer-written context (green) shows slight improvement, but LLM-generated context (orange) still degrades performance.

Summary of Results:

ConditionSWE-bench LiteAGENTbench
LLM-Generated Context-0.5% Success Rate ↓-2% Success Rate ↓
Developer-Written Context(Not testable)+4% Success Rate ↑

Context files automatically generated by LLMs degraded performance on both benchmarks. Even files written directly by developers showed only a marginal improvement. These results directly conflict with the industry recommendation to “always write context files.”

3-2. Costs Certainly Increase

On the other hand, costs definitely increased.

ModelDatasetCost (None)Cost (LLM)Cost (Human)
Sonnet-4.5SWE-bench$1.30$1.51 (+16%)
GPT-5.2SWE-bench$0.32$0.43 (+34%)
Sonnet-4.5AGENTbench$1.15$1.33 (+16%)$1.30 (+13%)
GPT-5.2AGENTbench$0.38$0.57 (+50%)$0.54 (+42%)
GPT-5.1 miniAGENTbench$0.18$0.20 (+11%)$0.19 (+6%)
Qwen3-30bAGENTbench$0.13$0.15 (+15%)$0.15 (+15%)

An average cost increase of over 20%, but without a corresponding performance gain. In other words, context files have a negative return on investment (ROI).


4. Why are Context Files Harmful?

4-1. Information Redundancy with Existing Docs

This is the most critical discovery of the paper.

Figure 5 - Documentation Removal Experiment Figure 5. AGENTbench results after removing existing documentation. When documents are absent, the LLM-generated context file (orange) shows an average +2.7% performance improvement, outperforming even the developer-written file (green).

The researchers re-ran the experiment after removing all .md files, example code, and docs/ folders from the repositories. The results:

  • With docs: LLM context file causes a -2% performance drop.
  • Without docs: LLM context file provides a +2.7% performance improvement.

This proves that context files generated by LLMs contain nearly the same information as READMEs and documentation already present in the repo. Redundant information confuses the agent.

To use a game development analogy, it’s like loading the same texture twice under different names. It only wastes memory without improving rendering quality.

4-2. No Improvement in File Discovery Speed

Figure 4 - Steps to File Discovery on SWE-bench Figure 4a. Number of steps taken by the agent to first interact with a relevant file on SWE-bench Lite.

Figure 4 - Steps to File Discovery on AGENTbench Figure 4b. The same pattern on AGENTbench. Having a context file does not help find relevant files any faster.

One might expect that a codebase overview in a context file would help an agent find relevant files faster, but in reality, it took 15 to 25 steps regardless of the presence of a context file. This means codebase overviews do not contribute to file exploration efficiency.

4-3. Increased Tool Use but Static Accuracy

Figure 6 - Increased Tool Use Figure 6. Average increase in tool use when context files are included. Use of almost all tools, including grep, tests, and file reading, increases, but the success rate does not.

Agent behavioral changes when context files are present:

  • grep use: +30-50% increase.
  • test execution: +40-100% increase.
  • repo-specific tools: Used 2.5x more when mentioned in the context file.
  • package managers (e.g., uv): Used 1.6x more when mentioned.

Agents faithfully follow the instructions in the context file. The problem is that following those instructions does not lead to better results. It’s a form of “diligent inefficiency” — more searching and more testing, but without getting any closer to the correct answer.

4-4. Increased Reasoning Token Consumption

Figure 7 - Reasoning Token Usage Figure 7. Average reasoning token usage for GPT-5.2 and GPT-5.1 mini. When a context file is present, more tokens are consumed for reasoning.

ModelReasoning Token Increase (LLM Context)Reasoning Token Increase (Human Context)
GPT-5.2+22%+20%
GPT-5.1 mini+14%+2%

Agents consume more tokens as they process the context file and reflect its instructions in their reasoning. Since this extra reasoning doesn’t lead to better results, it is pure waste.


5. Why Did This Happen?

Synthesizing the findings of the paper reveals the following mechanism:

1
2
3
4
5
6
7
8
9
10
11
Context File Provided
   ├── Agent faithfully follows instructions ✅
   │    ├── More tool use (grep, test, etc.)
   │    ├── Wider exploration range
   │    └── More reasoning tokens consumed
   │
   ├── But information is redundant ❌
   │    ├── Same content as README, docs/
   │    └── Information the agent can already access
   │
   └── Result: Cost ↑, Performance ↔ or ↓

Key Insight: The issue isn’t a lack of “instruction-following ability” in agents, but rather the information quality of the instructions themselves. Repeatedly telling an agent information it can already discover on its own only broadens the exploration scope and increases costs.


6. So, What Should We Do?

Recommendations from the Paper

For Repository Maintainers:

What NOT to doWhat to do
Auto-generate context files with LLMsManually write only minimal essential information
Describe the entire codebase structureDescribe only project-specific unique requirements
Write long and detailed context filesProvide short and specific instructions

For Agent Developers:

  • Reconsider recommending automatic context file generation.
  • Focus on improving the utilization of existing context rather than generating more context.

How to Write Effective Context Files

Based on the paper’s conclusions, here is a summary of what is worth including in a context file and what is not:

Information worth including (unique to the repo, not in docs):

1
2
3
4
5
6
7
8
9
10
11
# Build Commands (Project-specific)
bundle exec jekyll serve  # Local development
JEKYLL_ENV=production bundle exec jekyll build  # Production

# Special Tool Requirements
- Use 'uv' as the package manager
- Always run tests with 'pytest-asyncio'

# Project-specific Rules
- Use '.en.md' and '.ja.md' suffixes for translation files
- Store images under 'assets/img/post/{category}/'

Information to exclude (already in documentation):

1
2
3
4
5
6
7
8
9
# ❌ Codebase Structure Overview (Agent can explore directly)
src/
  components/  — React components
  utils/       — Utility functions
  ...

# ❌ General Coding Rules (LLM has already learned these)
- Use camelCase for variable names
- Functions should follow the Single Responsibility Principle

7. Limitations and Future Research

This paper is not the final word. Some limitations include:

  1. Focused on Python projects: The results are from Python repos well-represented in training data. Context files might be more useful in niche languages like Rust or Zig, or in specialized domains like game engines.
  2. Measured only task resolve rate: Other metrics like code quality, security, and style consistency were not evaluated.
  3. Single-task basis: The cumulative effect of repeatedly working on the same repo in a long-term project was not verified.
  4. Limitations of SWE-bench: Since it focuses on popular repos, the LLM may have already encountered the code in its training data.

Future Research Directions

  • The impact of context files in niche programming languages.
  • Evaluation of impact on code efficiency and security.
  • Methods for automatically generating a truly useful minimal context file.

8. Applying This to This Blog

This blog (Epheria) is a Jekyll-based static site, and I have a fairly detailed CLAUDE.md file. Applying the paper’s conclusions:

Information worth keeping:

  • Build commands like bundle exec jekyll serve — Project-specific.
  • jekyll-polyglot multilingual rules (.en.md, .ja.md suffixes) — Unique project rules not in standard documentation.
  • Post front matter format — Non-standard settings due to Chirpy theme customization.
  • assets/img/post/{category}/ image rules — Project-specific convention.

Information to reconsider:

  • Listing the entire directory structure — Agent can explore this directly.
  • Statistics on the number of posts by category — Informative, but may not contribute to agent performance.
  • General coding style rules — Already defined in .editorconfig.

However, since this blog is a niche project with many Jekyll + Chirpy customizations rather than a “popular Python repo,” it’s more reasonable to refine the context file with project-specific information rather than following the “omit context file” recommendation literally.


Conclusion

The conclusion of this paper can be summarized in one sentence:

“Don’t tell the agent what it already knows.”

Context files themselves are not bad. Including redundant information is what’s bad. Minimal context files containing only project-specific information, written directly by developers rather than auto-generated by LLMs, still showed a slight positive effect.

When writing a context file, ask yourself: “Can the agent figure this out on its own by exploring the repo?” If the answer is “yes,” it’s better not to include it in the context file.


References

This post is licensed under CC BY 4.0 by the author.