area_chartResearch Report - April 2026

Grapuco Benchmark

Does providing business context to AI coding agents actually produce better code? We tested 5 different context strategies on a real-world e-commerce project.

smart_toyClaude Sonnet 4.6science5 scenariostask_alt5 progressive tasksrepeat100 runs per config

All results below are averaged across 100 independent runs per configuration to ensure statistical reliability.

View Resultsarrow_downward

The Problem

AI coding agents today can generate code impressively fast - but they code "blind". Without understanding the business domain, they produce code that compiles but misses critical business rules, validation logic, and domain-specific constraints. The result? Code that works syntactically but fails semantically.

help

Research Question

If we provide structured business specifications - Use Cases, Data Entities, Business Rules, and Workflow Flows - to an AI coding agent via MCP (Model Context Protocol), does the generated code quality improve compared to having no context, static documentation, or code-structure-only awareness?

Experimental Setup

A controlled benchmark testing 5 context strategies on the same e-commerce project.

codeView Benchmark Source Codeopen_in_new

shopping_cart

The Subject - BabyShop E-commerce

Full-stack e-commerce platform with 4 modules and 8 use cases

terminalBackend

NestJS + TypeORM

PostgreSQL database, class-validator DTOs, modular architecture

webFrontend

Next.js 14 App Router

TypeScript, Tailwind CSS, React hooks state management

architectureBusiness

4 Modules, 8 Use Cases

Dashboard, Orders, Products, Shopping & Checkout

task_alt

5 Progressive Tasks

Each task builds on the previous, from project init to admin dashboard

T1 - Project Init + Schema

Initialize NestJS + Next.js monorepo with PostgreSQL. Create Category and Product entities with all fields and relationships.

T2 - Product CRUD + Storefront

Full CRUD for Categories and Products. Build storefront with homepage, category listing, product detail pages, and pagination.

T3 - Cart + Checkout

Shopping cart with add/update/remove + stock validation. Checkout flow with customer info, shipping fee, and order creation.

T4 - Order Management

Order status workflow (PENDING → CONFIRMED → SHIPPING → DELIVERED/CANCELLED). Admin order list, detail, and status transition with history.

T5 - Admin Dashboard

Admin dashboard with revenue charts, top-selling products, recent orders, and low-stock alerts. Summary stat widgets.

science

5 Scenarios

From static documentation to full business + code awareness via MCP

description

Markdown Spec

Static file (119 lines)

A static 119-line Markdown specification file injected as a system prompt. Contains data models, business rules, API routes, and code conventions.

auto_awesome

Spec Agent (MCP)

MCP - Business specs only

Business specs fetched via Grapuco MCP Server. The agent calls get_context and get_active_task_context to retrieve Use Cases, Data Entities, Business Rules, and Flows.

hub

Graph Only (MCP)

MCP - Code graph only

Architecture graph via Grapuco MCP. The agent uses get_architecture and get_dependencies to understand existing code structure, but has no business specs.

rocket_launch

Full Grapuco (MCP)

MCP - Spec + Graph

Full Grapuco stack: Spec Agent + Architecture Graph combined. The agent has access to both business context AND code structure awareness via MCP.

storage

GitNexus (Local)

MCP - Local indexing

Local code indexing competitor. Uses stdio-based MCP to analyze the workspace after each task. Provides code search but no business context.

biotech

Methodology

Automated, reproducible, and isolated

smart_toy

AI Model

Claude Sonnet 4.6 via Claude Code CLI with --dangerously-skip-permissions for full automation

cleaning_services

Workspace Isolation

Each run wipes the workspace completely. Zero contamination between configurations

build

Build Validation

npm run build for both backend and frontend. Build must pass or errors are fed back for rework (max 5 retries)

timer

Timeout

15-minute timeout per task. Tasks exceeding this limit are marked as failed

Overall Results

Average project completion time across all scenarios (lower is better).

bar_chartProject Completion Time

Total time to complete all 5 tasks (averaged across 100 runs)

52m37s

130.7K

description

Markdown

40m24s

144.6K

auto_awesome

Spec Agent

41m41s

159.8K

hub

Graph Only

33m09s

121.9K

rocket_launch

Grapuco

34m55s

132.8K

storage

GitNexus

arrow_downward Shorter = Faster

Scenario Breakdown

Deep dive into each scenario - real prompts, MCP configuration, and per-task performance averaged across 100 runs.

Scenarios

rocket_launch

Full Grapuco (MCP)

MCP - Spec + Graph

scienceScenario Description

The AI agent has FULL access to both Spec Agent and Architecture Graph via Grapuco MCP. Before each task, it MUST fetch both business specs (Use Cases, Business Rules) AND code structure (architecture map, dependencies). This represents the complete Grapuco experience.

inventory_2Resources Used

edit_note

Task Prompt

Standard task prompt with requirements and acceptance criteria

psychology

System Prompt

Full instructions with CRITICAL INSTRUCTION enforcing both spec + graph queries

cloud

Grapuco MCP (Full)

All 13 tools: get_context, get_architecture, get_dependencies, get_data_flows, get_impact_analysis, search_code, list_projects, and more

account_tree

8 Use Cases

Full structured business flows from Spec Agent

hub

Architecture Graph

Live code structure - entities, services, controllers, relationships, call chains

Project Time

33m09s

Avg / Task

6m38s

Total Tokens

121.9K

Pass Rate

100%

Analysis & Discussion

Interpreting the benchmark results - what the data suggests and where further investigation is needed.

analyticsKey Observations

speed

Combined context achieves the fastest completion

The Full Grapuco scenario (Spec + Graph) completed all 5 tasks in 33m09s on average — 37% faster than the Markdown scenario (52m37s) and the most time-efficient across all tested configurations. When an AI agent has simultaneous access to both structured business specifications and live code architecture, it spends less time iterating and produces buildable code more efficiently.

description

Static documentation introduces overhead without proportional benefit

The Markdown scenario recorded the longest total time at 52m37s — significantly slower than all other configurations. This suggests that injecting a large static specification file as a system prompt may cause the model to over-process context rather than act on it. The additional tokens consumed (130.7K) did not translate into time savings.

hub

Code structure awareness alone increases token usage

Graph Only consumed the highest token count (159.8K) across all scenarios while completing in 41m41s. Having access to architecture graph tools without business specifications led the agent to perform extensive graph traversals, increasing token consumption without a corresponding reduction in completion time. This implies structural context is most effective when paired with domain knowledge.

generating_tokens

Full Grapuco delivers the best token efficiency

Grapuco achieved 121.9K total tokens across all tasks — the lowest across all scenarios — while maintaining the fastest completion time. This is 24% fewer tokens than Graph Only and 7% fewer than Markdown, suggesting that combining business specs with code structure helps the model generate more targeted, concise output without excessive exploration.

lightbulb

Conclusion

The benchmark data indicates that providing AI coding agents with combined business context and code structure awareness via MCP leads to measurably faster project completion with lower token consumption. Notably, partial context strategies (business specs only, or code graph only) did not outperform the static documentation baseline in completion time. This suggests that the synergy between domain knowledge and architectural awareness is what drives the improvement — neither alone is sufficient to reliably accelerate AI-assisted development.

infoLimitations & Future Work

chevron_rightAll runs used a single AI model (Claude Sonnet 4.6). Results may differ across model families, sizes, and providers.
chevron_rightThe benchmark project (BabyShop) represents a specific e-commerce domain. Generalizability to other domains (fintech, healthcare, infrastructure) has not been tested.
chevron_rightBuild pass/fail is a binary quality metric. A deeper qualitative analysis of generated code (adherence to business rules, code maintainability, test coverage) would provide additional insight.
chevron_rightThe benchmark measures time and token efficiency. Production readiness factors such as security, scalability, and edge case handling require separate evaluation.

Ready to supercharge your AI coding agent?

Give your AI the business context it needs. Grapuco transforms your codebase into a Knowledge Graph and serves it via MCP - so your AI agent codes with understanding, not guessing.

rocket_launchTry Grapuco Free schoolRead the Docs