Grapuco Benchmark
Does providing business context to AI coding agents actually produce better code? We tested 6 different context strategies on a real-world e-commerce project.
All results below are averaged across 100 independent runs per configuration to ensure statistical reliability.
View Resultsarrow_downwardThe Problem
AI coding agents today can generate code impressively fast - but they code "blind". Without understanding the business domain, they produce code that compiles but misses critical business rules, validation logic, and domain-specific constraints. The result? Code that works syntactically but fails semantically.
Research Question
If we provide structured business specifications - Use Cases, Data Entities, Business Rules, and Workflow Flows - to an AI coding agent via MCP (Model Context Protocol), does the generated code quality improve compared to having no context, static documentation, or code-structure-only awareness?
Experimental Setup
A controlled benchmark testing 6 context strategies on the same e-commerce project.
codeView Benchmark Source Codeopen_in_newThe Subject - BabyShop E-commerce
Full-stack e-commerce platform with 4 modules and 8 use cases
NestJS + TypeORM
PostgreSQL database, class-validator DTOs, modular architecture
Next.js 14 App Router
TypeScript, Tailwind CSS, React hooks state management
4 Modules, 8 Use Cases
Dashboard, Orders, Products, Shopping & Checkout
5 Progressive Tasks
Each task builds on the previous, from project init to admin dashboard
T1 - Project Init + Schema
Initialize NestJS + Next.js monorepo with PostgreSQL. Create Category and Product entities with all fields and relationships.
T2 - Product CRUD + Storefront
Full CRUD for Categories and Products. Build storefront with homepage, category listing, product detail pages, and pagination.
T3 - Cart + Checkout
Shopping cart with add/update/remove + stock validation. Checkout flow with customer info, shipping fee, and order creation.
T4 - Order Management
Order status workflow (PENDING → CONFIRMED → SHIPPING → DELIVERED/CANCELLED). Admin order list, detail, and status transition with history.
T5 - Admin Dashboard
Admin dashboard with revenue charts, top-selling products, recent orders, and low-stock alerts. Summary stat widgets.
6 Scenarios
From zero context to full business + code awareness via MCP
Naked (No Context)
None
Zero business context. The AI agent receives only the task prompt and must infer all requirements from the prompt alone.
Markdown Spec
Static file (119 lines)
A static 119-line Markdown specification file injected as a system prompt. Contains data models, business rules, API routes, and code conventions.
Spec Agent (MCP)
MCP - Business specs only
Business specs fetched via Grapuco MCP Server. The agent calls get_context and get_active_task_context to retrieve Use Cases, Data Entities, Business Rules, and Flows.
Graph Only (MCP)
MCP - Code graph only
Architecture graph via Grapuco MCP. The agent uses get_architecture and get_dependencies to understand existing code structure, but has no business specs.
Full Grapuco (MCP)
MCP - Spec + Graph
Full Grapuco stack: Spec Agent + Architecture Graph combined. The agent has access to both business context AND code structure awareness via MCP.
GitNexus (Local)
MCP - Local indexing
Local code indexing competitor. Uses stdio-based MCP to analyze the workspace after each task. Provides code search but no business context.
Methodology
Automated, reproducible, and isolated
AI Model
Claude Sonnet 4.6 via Claude Code CLI with --dangerously-skip-permissions for full automation
Workspace Isolation
Each run wipes the workspace completely. Zero contamination between configurations
Build Validation
npm run build for both backend and frontend. Build must pass or errors are fed back for rework (max 5 retries)
Timeout
15-minute timeout per task. Tasks exceeding this limit are marked as failed
Overall Results
Average project completion time across all scenarios (lower is better).
bar_chartProject Completion Time
Total time to complete all 5 tasks (averaged across 100 runs)
31m40s
119.1K
Naked
52m37s
130.7K
Markdown
40m24s
144.6K
Spec Agent
41m41s
159.8K
Graph Only
33m09s
121.9K
Grapuco
34m55s
132.8K
GitNexus
arrow_downward Shorter = Faster
Scenario Breakdown
Deep dive into each scenario - real prompts, MCP configuration, and per-task performance averaged across 100 runs.
Scenarios
Full Grapuco (MCP)
MCP - Spec + Graph
scienceScenario Description
The AI agent has FULL access to both Spec Agent and Architecture Graph via Grapuco MCP. Before each task, it MUST fetch both business specs (Use Cases, Business Rules) AND code structure (architecture map, dependencies). This represents the complete Grapuco experience.
inventory_2Resources Used
Task Prompt
Same task prompt as Naked arm
System Prompt
Full instructions with CRITICAL INSTRUCTION enforcing both spec + graph queries
Grapuco MCP (Full)
All 13 tools: get_context, get_architecture, get_dependencies, get_data_flows, get_impact_analysis, search_code, list_projects, and more
8 Use Cases
Full structured business flows from Spec Agent
Architecture Graph
Live code structure - entities, services, controllers, relationships, call chains
Project Time
33m09s
Avg / Task
6m38s
Total Tokens
121.9K
Pass Rate
100%
Analysis & Discussion
Interpreting the benchmark results - what the data suggests and where further investigation is needed.
analyticsKey Observations
Combined context achieves the fastest completion
The Full Grapuco scenario (Spec + Graph) completed all 5 tasks in 33m09s on average - 37% faster than Markdown (52m37s) and 4% faster than Naked (31m48s when accounting for rework). When an AI agent has simultaneous access to both structured business specifications and live code architecture, it spends less time iterating and produces buildable code more efficiently.
Static documentation introduces overhead without proportional benefit
The Markdown scenario recorded the longest total time at 52m37s - 65% slower than the Naked baseline. This suggests that injecting a large static specification file as a system prompt may cause the model to over-process context rather than act on it. The additional tokens consumed (130.7K vs 119.1K for Naked) did not translate into time savings.
Code structure awareness alone increases token usage
Graph Only consumed the highest token count (159.8K) across all scenarios while completing in 41m41s. Having access to architecture graph tools without business specifications led the agent to perform extensive graph traversals, increasing token consumption without a corresponding reduction in completion time. This implies structural context is most effective when paired with domain knowledge.
Full Grapuco delivers the best token efficiency
Grapuco achieved 121.9K total tokens across all tasks - the second lowest after Naked (119.1K) - while maintaining the fastest completion time. This is 24% fewer tokens than Graph Only and 7% fewer than Markdown, suggesting that combining business specs with code structure helps the model generate more targeted, concise output without excessive exploration.
Conclusion
The benchmark data indicates that providing AI coding agents with combined business context and code structure awareness via MCP leads to measurably faster project completion with lower token consumption. Notably, partial context strategies (business specs only, or code graph only) did not outperform the no-context baseline in time. This suggests that the synergy between domain knowledge and architectural awareness is what drives the improvement - neither alone is sufficient to reliably accelerate AI-assisted development.
infoLimitations & Future Work
- chevron_rightAll runs used a single AI model (Claude Sonnet 4.6). Results may differ across model families, sizes, and providers.
- chevron_rightThe benchmark project (BabyShop) represents a specific e-commerce domain. Generalizability to other domains (fintech, healthcare, infrastructure) has not been tested.
- chevron_rightBuild pass/fail is a binary quality metric. A deeper qualitative analysis of generated code (adherence to business rules, code maintainability, test coverage) would provide additional insight.
- chevron_rightThe benchmark measures time and token efficiency. Production readiness factors such as security, scalability, and edge case handling require separate evaluation.
Ready to supercharge your AI coding agent?
Give your AI the business context it needs. Grapuco transforms your codebase into a Knowledge Graph and serves it via MCP - so your AI agent codes with understanding, not guessing.