AI isn't hard. Understanding your codebase is.

Anyone can point an LLM at a repository and generate code. What’s hard is knowing what that code should do — and whether it’s correct.

We use GraphRAG the same way we use compilers and databases — as a tool grounded in structure, not guesswork.

You get AI that works with your system, not against it.

What are GraphRAG solutions?

GraphRAG combines a knowledge graph grounded in your own data with an LLM to:

Deliver context-aware answers centered around your own data
Follow patterns specified in your knowledge graph
Modify or redirect LLM responses whenever your knowledge graph changes

Although GraphRAG can be applied to a diverse range of problems, we initially focus on:

"Large, complex codebases with multiple languages, integrations, and poor documentation"

Large repositories contain:

Multiple languages
Build systems
Framework glue
Test harnesses
Configuration files
Generated assets

If you don’t know how these pieces connect, manually browsing the repo becomes guesswork.

    flowchart LR

    %% Core Codebase
    subgraph Core_Codebase["Large & Complex Codebase"]
        A("Legacy Module (C++)")
        B("Backend Services (Java)")
        C("Data Processing (Python)")
        D("Frontend API (Javascript)")
        E("Automation Scripts (Bash)")
    end

    %% Databases
    subgraph Data_Layer["Data Layer"]
        DB1[(PostgreSQL)]
        DB2[(MongoDB)]
        DB3[(Redis Cache)]
    end

    %% External Integrations
    subgraph Integrations["External Integrations"]
        API1["Payment API"]
        API2["Analytics API"]
        API3["Partner System"]
        API4["Cloud Storage"]
    end

    %% Internal Connections
    A --> B
    B -->|Anti-pattern|C
    C -->|Brittle Rest APIs| DB1
    C --> DB2
    B -->|Unknown dependencies| DB3
    D --> B
    E --> C

    %% Integrations
    B --> API1
    C --> API2
    D -->|Undocumented hooks| API3
    E -->|Silent failure points| API4

    %% Documentation
    DOC("Incomplete / Outdated Documentation")
    DOC -. Missing .- A
    DOC -. Outdated .- B
    DOC -. Incomplete .- C
    DOC -. Inaccurate .- D

Why use GraphRAG for coding challenges?

Refactoring code with LLMs isn’t new — but standard prompting quickly runs into the following problems:

How do you fit an entire codebase into an LLM prompt? Every model has a token limit.
How do you stop an LLM from hallucinating? Left unchecked, it drifts — and doom loops begin.
How do you enforce architecture and coding standards — and can you really trust an LLM to follow them?

GraphRAG workflow

There are two parts to a GraphRAG workflow:

Create two codebase knowledge graphs (KGs) deterministically (ie. without LLMs):

Data KG - ingest database schema metadata directly into a graph model

flowchart TD
    subgraph Data KG
    A(Database)-->|Extract|I(Schema)
    I-->|Metadata|J(Graph model)
    J-->K(Data KG) 
end

Application KG - use language-specific parsers to map connections between functions, parameters, services, etc., across files and repos

flowchart TD
    subgraph Application KG
    B(Codebase repo)
    B-->|Parse syntax|L(Abstract Syntax Tree)
    B-->|Full-text search|M(Lexical search)
    O(Graph model)
    N(Application KG)
    L-->O
    M-->O
    O-->N
end

When the entire codebase has been mapped, we can:
- Select only the relevant parts of the codebase to fit into an LLM's context window
- Prompt an LLM to explain or refactor code
- Check adherence to code architecture, patterns and standards by parsing and LLM's responses for comparison against the codebase KGs
```
flowchart LR
    subgraph Deterministic
    A(Data KG)
    B(Application KG)
    end
    A-->|Graph query|C
    B-->|Graph query|C
    subgraph Probablistic
    C(Prompt Interface)
    D(LLM)
    end
    C-->D
    D-->|Parse syntax & compare|Deterministic
```

Codebase Knowledge Graphs (KGs) are software ontologies

A codebase KG is not a new concept. They are simply a way of developing and storing software ontologies to be used by both humans and machines.

Traditional approaches to ontology specification—using languages such as OWL, RDFS, or SPARQL—do not lend themselves well to highly structured artefacts like source code. Codebase KG schemas must instead be developed iteratively alongside the underlying codebase. As new questions arise, the KG schema needs to evolve to support them. Likewise, when the codebase changes, the schema may also need to be updated. The process therefore forms a continuous feedback loop.

The equivalence of KG schemas to store software ontologies is more fully explored in the following 2025 AWS re:Invent paper: Symbolic AI in the Age of LLMs

Project layout: Codebase teardown engine

WHY: When you need to understand and refactor an unfamiliar and complex codebase quickly

HOW: A deterministic, modular build system that maps an entire codebase into a graph. Graph queries can then be used to prompt LLMs to refactor syntax, which is then checked against a new, pre-specified graph schema.

STEPS: A systematic approach to understanding any unfamiliar repository:

Extract repository metadata to analyse file types and directory structure
Traverse the Abstract Syntax Tree (AST) of each code file to extract all classes, functions and variables
Storing everything in a KG for structural reasoning
Using the KG to re-design / re-architect the codebase
Use KG queries as prompts for an LLM to refactor code
Auto-check the LLM's syntax against the revised KG schema

TOOLING REPO:

graphrag-code-kg/
│
├── pyproject.toml
├── README.md
├── .env
│
├── notebooks/                             # Experimental code + presentations
│
├── src/
│   └── codebase_kg/
│       ├──__init__.py
│       │
│       ├── services/                      # Backend integration APIs    
│       │   ├── neo4j_services.py          # Neo4j connection function
│       │   ├── neo4j_cypher.py            # Neo4j cypher query function
│       │   └── github_repo_service.py     # Repo metadata via Github API
│       │
│       ├─ cli/                            # Thin API interface layer 
│       │   ├─ cli.py                      # Auto-scan ./commands folder
│       │   │
│       │   └─ commands
│       │       ├─ neo4j.py                # CLI syntax - imports Neo4j service
│       │       └─ repo.py                 # CLI syntax - imports Github service
│       │
│       ├── schemas/                       # Software architecture schemas 
│       │
│       ├── data_kg/                       # Database metadata ingestion
│       │   ├── extract/                   # Connectors + foreign key logic
│       │   ├── transform/                 # Convert raw metadata → graph model
│       │   ├── load/                      # Bulk CSV → Neo4j
│       │   └── pipeline.py                # Orchestrates full Data KG build
│       │
│       ├── app_kg/                        # Business logic metadata ingestion
│       │   ├── repo_metadata/             # Directory structure + file types
│       │   ├── parsers/                   # Language-specific AST extraction
│       │   ├── codesearch/                # Cross-file usage + invocations
│       │   ├── transform/                 # Convert raw metadata → graph model 
│       │   ├── load/                      # Bulk CSV → Neo4j
│       │   └── pipeline.py                # Ochestrates full AppKG build
│       │
│       ├── neo4jImports/                  # CSV files to import into Neo4j 
│       │
│       └── graph_queries/                 # Deterministic graph intelligence
│           ├── shortest_join_paths.cypher
│           ├── application_logic.cypher
│           └── dependency_paths.cypher
│
├── tests/                                 # Test-driven development repo 
│
├── graphrag/                              # Graph-LLM interaction
│   ├── embeddings/
│   ├── retrievers/
│   ├── prompt_builders/
│   └── response_synthesis/
│
└── docs/                                  # MkDocs code documentation site