Does AGENTS.md Really Help? - Analysis of a Paper Verifying the Impact of Context Files on Coding Agents

Posted Feb 25, 2026

Intermediate

By Sehyup

10 min read

Does AGENTS.md Really Help? - Analysis of a Paper Verifying the Impact of Context Files on Coding Agents

Prerequisites — Read these first

AI in Game Development: Insights from 2,165 Messages - A Developer's 47-Day Log
Claude Opus 4.5 → 4.6 Transition: Performance, Tokens, Workflow Changes Experienced by a Game Developer
Does AGENTS.md Really Help? - Analysis of a Paper Verifying the Impact of Context Files on Coding Agents
Claude Memory Goes Free, /simplify & /batch — and the Hidden Cost of CLAUDE.md
Complete Guide to Installing Claude Code on Windows — With Real-World Troubleshooting
게임 기획자를 위한 Claude Code 완전 활용 가이드 — 사양서부터 밸런싱까지

Claude Opus 4.5 → 4.6 Transition... Claude Memory Goes Free, /simpli...

TL;DR — Key Takeaways

While context files like AGENTS.md and CLAUDE.md have been adopted by over 60,000 repositories, LLM-generated context files can actually degrade performance and increase costs by more than 20%.
The primary cause is redundancy with existing documentation — when documents were removed, context files showed a +2.7% performance improvement, proving that information overload is the issue.
Only minimal context files directly written by developers show a slight positive effect, and specific tooling requirements are more useful than general codebase overviews.

Introduction

When using Claude Code, writing a CLAUDE.md file has become a near-default workflow. OpenAI’s Codex recommends AGENTS.md, and Anthropic’s Claude Code suggests CLAUDE.md. These context files have now been adopted by over 60,000 open-source repositories.

But do these files really improve agent performance? Researchers from ETH Zurich and Anthropic conducted the first large-scale empirical study on this question. The results are quite different from industry conventional wisdom.

Paper: “Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?” Authors: Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, Martin Vechev (ETH Zurich / Anthropic) arXiv: 2602.11988

1. Background: What are Context Files?

Context files for coding agents are Markdown files located at the repository root. They serve to inform the agent about the project structure, build methods, coding rules, and more.

Agent	Filename	Developer
Claude Code	`CLAUDE.md`	Anthropic
Codex	`AGENTS.md`	OpenAI
Cursor	`.cursorrules`	Cursor
Windsurf	`.windsurfrules`	Codeium

Common components of these files:

Codebase Overview: Directory structure, descriptions of key modules.
Build/Test Commands: npm test, bundle exec jekyll serve, etc.
Coding Conventions: Naming rules, style guides.
Project-Specific Requirements: Use of specific tools, precautions.

The industry has expected that providing these files would help agents understand projects better and generate more accurate code. But is that actually the case?

2. Experimental Design

2-1. Evaluation Pipeline

Figure 1. Overview of the evaluation pipeline. The agent solves tasks under three conditions (Developer-written context / No context / LLM-generated context), and the generated patches are evaluated with unit tests.

The researchers compared three conditions:

None — Working without a context file.
LLM — Providing a context file automatically generated by an LLM.
Human — Providing a context file directly written by a developer.

2-2. Benchmarks

SWE-bench Lite (Existing Benchmark)

300 tasks from 11 popular Python repositories.
Well-known projects like Django, scikit-learn, and sympy.
Since original repositories lacked context files, only None vs. LLM could be compared.

AGENTbench (Newly created for this paper)

138 instances from 12 niche Python repositories where developer-written context files actually exist.
Allows comparison across all three conditions: None, LLM, and Human.
Carefully curated from 5,694 PRs.

Figure 2. Distribution of AGENTbench instances across 12 open-source repositories. Compared to SWE-bench, it is more evenly distributed across repositories rather than being biased toward specific ones.

AGENTbench Statistics

Metric	Average	Minimum	Maximum
PR Body Length (words)	415.3	5	4,961
Issue Description (words)	211.6	96	500
Codebase File Count	3,337	151	26,602
PR Patch Line Count	118.9	12	1,973
Files Modified	2.5	1	23
Test Coverage	75%	2.5%	100%
Context File Length (words)	641.0	24	2,003
Context File Section Count	9.7	1	29

2-3. Agents Tested

Agent	Model	Developer
Claude Code	Sonnet-4.5	Anthropic
Codex	GPT-5.2	OpenAI
Codex	GPT-5.1 mini	OpenAI
Qwen Code	Qwen3-30b-coder	Alibaba

They evaluated 4 major coding agents across 2 benchmarks and 3 conditions — testing a total of 20 experimental combinations.

3. Key Results: The Context File Paradox

3-1. Impact on Performance

Figure 3a. Resolve rate on SWE-bench Lite. Across all four models, the LLM-generated context file (orange) shows an equal or lower success rate compared to no context (blue).

Figure 3b. Resolve rate on AGENTbench. Developer-written context (green) shows slight improvement, but LLM-generated context (orange) still degrades performance.

Summary of Results:

Condition	SWE-bench Lite	AGENTbench
LLM-Generated Context	-0.5% Success Rate ↓	-2% Success Rate ↓
Developer-Written Context	(Not testable)	+4% Success Rate ↑

Context files automatically generated by LLMs degraded performance on both benchmarks. Even files written directly by developers showed only a marginal improvement. These results directly conflict with the industry recommendation to “always write context files.”

3-2. Costs Certainly Increase

On the other hand, costs definitely increased.

Model	Dataset	Cost (None)	Cost (LLM)	Cost (Human)
Sonnet-4.5	SWE-bench	$1.30	$1.51 (+16%)	—
GPT-5.2	SWE-bench	$0.32	$0.43 (+34%)	—
Sonnet-4.5	AGENTbench	$1.15	$1.33 (+16%)	$1.30 (+13%)
GPT-5.2	AGENTbench	$0.38	$0.57 (+50%)	$0.54 (+42%)
GPT-5.1 mini	AGENTbench	$0.18	$0.20 (+11%)	$0.19 (+6%)
Qwen3-30b	AGENTbench	$0.13	$0.15 (+15%)	$0.15 (+15%)

An average cost increase of over 20%, but without a corresponding performance gain. In other words, context files have a negative return on investment (ROI).

4. Why are Context Files Harmful?

4-1. Information Redundancy with Existing Docs

This is the most critical discovery of the paper.

Figure 5. AGENTbench results after removing existing documentation. When documents are absent, the LLM-generated context file (orange) shows an average +2.7% performance improvement, outperforming even the developer-written file (green).

The researchers re-ran the experiment after removing all .md files, example code, and docs/ folders from the repositories. The results:

With docs: LLM context file causes a -2% performance drop.
Without docs: LLM context file provides a +2.7% performance improvement.

This proves that context files generated by LLMs contain nearly the same information as READMEs and documentation already present in the repo. Redundant information confuses the agent.

To use a game development analogy, it’s like loading the same texture twice under different names. It only wastes memory without improving rendering quality.

4-2. No Improvement in File Discovery Speed

Figure 4a. Number of steps taken by the agent to first interact with a relevant file on SWE-bench Lite.

Figure 4b. The same pattern on AGENTbench. Having a context file does not help find relevant files any faster.

One might expect that a codebase overview in a context file would help an agent find relevant files faster, but in reality, it took 15 to 25 steps regardless of the presence of a context file. This means codebase overviews do not contribute to file exploration efficiency.

4-3. Increased Tool Use but Static Accuracy

Figure 6. Average increase in tool use when context files are included. Use of almost all tools, including grep, tests, and file reading, increases, but the success rate does not.

Agent behavioral changes when context files are present:

grep use: +30-50% increase.
test execution: +40-100% increase.
repo-specific tools: Used 2.5x more when mentioned in the context file.
package managers (e.g., uv): Used 1.6x more when mentioned.

Agents faithfully follow the instructions in the context file. The problem is that following those instructions does not lead to better results. It’s a form of “diligent inefficiency” — more searching and more testing, but without getting any closer to the correct answer.

4-4. Increased Reasoning Token Consumption

Figure 7. Average reasoning token usage for GPT-5.2 and GPT-5.1 mini. When a context file is present, more tokens are consumed for reasoning.

Model	Reasoning Token Increase (LLM Context)	Reasoning Token Increase (Human Context)
GPT-5.2	+22%	+20%
GPT-5.1 mini	+14%	+2%

Agents consume more tokens as they process the context file and reflect its instructions in their reasoning. Since this extra reasoning doesn’t lead to better results, it is pure waste.

5. Why Did This Happen?

Synthesizing the findings of the paper reveals the following mechanism:

Context File Provided
   ├── Agent faithfully follows instructions ✅
   │    ├── More tool use (grep, test, etc.)
   │    ├── Wider exploration range
   │    └── More reasoning tokens consumed
   │
   ├── But information is redundant ❌
   │    ├── Same content as README, docs/
   │    └── Information the agent can already access
   │
   └── Result: Cost ↑, Performance ↔ or ↓

Key Insight: The issue isn’t a lack of “instruction-following ability” in agents, but rather the information quality of the instructions themselves. Repeatedly telling an agent information it can already discover on its own only broadens the exploration scope and increases costs.

6. So, What Should We Do?

Recommendations from the Paper

For Repository Maintainers:

What NOT to do	What to do
Auto-generate context files with LLMs	Manually write only minimal essential information
Describe the entire codebase structure	Describe only project-specific unique requirements
Write long and detailed context files	Provide short and specific instructions

For Agent Developers:

Reconsider recommending automatic context file generation.
Focus on improving the utilization of existing context rather than generating more context.

How to Write Effective Context Files

Based on the paper’s conclusions, here is a summary of what is worth including in a context file and what is not:

Information worth including (unique to the repo, not in docs):

  
# Build Commands (Project-specific)
bundle exec jekyll serve  # Local development
JEKYLL_ENV=production bundle exec jekyll build  # Production

# Special Tool Requirements
- Use 'uv' as the package manager
- Always run tests with 'pytest-asyncio'

# Project-specific Rules
- Use '.en.md' and '.ja.md' suffixes for translation files
- Store images under 'assets/img/post/{category}/'

Information to exclude (already in documentation):

  
# ❌ Codebase Structure Overview (Agent can explore directly)
src/
  components/  — React components
  utils/       — Utility functions
  ...

# ❌ General Coding Rules (LLM has already learned these)
- Use camelCase for variable names
- Functions should follow the Single Responsibility Principle

7. Limitations and Future Research

This paper is not the final word. Some limitations include:

Focused on Python projects: The results are from Python repos well-represented in training data. Context files might be more useful in niche languages like Rust or Zig, or in specialized domains like game engines.
Measured only task resolve rate: Other metrics like code quality, security, and style consistency were not evaluated.
Single-task basis: The cumulative effect of repeatedly working on the same repo in a long-term project was not verified.
Limitations of SWE-bench: Since it focuses on popular repos, the LLM may have already encountered the code in its training data.

Future Research Directions

The impact of context files in niche programming languages.
Evaluation of impact on code efficiency and security.
Methods for automatically generating a truly useful minimal context file.

8. Applying This to This Blog

This blog (Epheria) is a Jekyll-based static site, and I have a fairly detailed CLAUDE.md file. Applying the paper’s conclusions:

Information worth keeping:

Build commands like bundle exec jekyll serve — Project-specific.
jekyll-polyglot multilingual rules (.en.md, .ja.md suffixes) — Unique project rules not in standard documentation.
Post front matter format — Non-standard settings due to Chirpy theme customization.
assets/img/post/{category}/ image rules — Project-specific convention.

Information to reconsider:

Listing the entire directory structure — Agent can explore this directly.
Statistics on the number of posts by category — Informative, but may not contribute to agent performance.
General coding style rules — Already defined in .editorconfig.

However, since this blog is a niche project with many Jekyll + Chirpy customizations rather than a “popular Python repo,” it’s more reasonable to refine the context file with project-specific information rather than following the “omit context file” recommendation literally.

Conclusion

The conclusion of this paper can be summarized in one sentence:

“Don’t tell the agent what it already knows.”

Context files themselves are not bad. Including redundant information is what’s bad. Minimal context files containing only project-specific information, written directly by developers rather than auto-generated by LLMs, still showed a slight positive effect.

When writing a context file, ask yourself: “Can the agent figure this out on its own by exploring the repo?” If the answer is “yes,” it’s better not to include it in the context file.

References

Gloaguen, T., Mündler, N., Müller, M., Raychev, V., & Vechev, M. (2026). Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? arXiv:2602.11988. https://arxiv.org/abs/2602.11988
OpenAI. (2026). Custom instructions with AGENTS.md. https://developers.openai.com/codex/guides/agents-md/
Anthropic. (2026). Claude Code Documentation — CLAUDE.md. https://docs.anthropic.com/en/docs/claude-code

AI, Claude

This post is licensed under CC BY 4.0 by the author.