Can an LLM Find Design Flaws in Code It Can't Read?

Security code review with Large Language Models (LLMs) has a structural problem that most people wave away: an LLM reads code one file at a time, but design flaws do not live in one file. They live in the relationships between components, and more often, in what is absent. The auth middleware that was never applied, the validation that does not exist, the rate limiting that covers four routes out of a hundred and nine.

Pattern-matching tools like Semgrep see the whole codebase but cannot express “find all routes that do NOT have auth middleware.” LLMs can reason about that question but cannot see the whole codebase. Both approaches find local bugs well enough, and neither finds design flaws.

I wanted to know if there was a middle path: give the LLM a structural graph of the codebase that it can query, rather than source code to read. The graph would carry the cross-file relationships while the LLM provides the reasoning, and the question was whether the combination could surface design-level findings that neither tool reaches alone.

The graph: Code Property Graphs

A Code Property Graph (CPG) overlays three views of the same codebase: the abstract syntax tree (structure), the control flow graph (execution paths), and the program dependence graph (data flow). Joern, the open-source CPG engine, builds all three into a single queryable graph and exposes a Scala-based query language called CPGQL.

For the target application, an Express/TypeScript app with custom JWT auth and no input validation, Joern produced a graph of 443,000 nodes. The LLM never touched those nodes directly, and instead formulated CPGQL queries and worked with the small, structured results that came back.

Two passes: learn the architecture, then interrogate the design

The approach has two phases. In Pass 1, the LLM knows only the framework name (“this is an Express app with Sequelize”) and asks nine generic structural questions: how many routes? what middleware exists? is there input validation? what database operations are exposed? These questions require no knowledge of the specific application, and the answers produce a structural profile of the codebase.

In Pass 2, the LLM reads the Pass 1 results (still not source code) and writes targeted queries for eight CWE categories under OWASP A06: Insecure Design. This is where it gets interesting, because the LLM can now ask questions like “which routes lack any security middleware” using negative sub-traversals, a query pattern that filters for absence rather than presence. Pattern matchers cannot express this, and LLMs reading source files cannot hold the full route inventory in context. The CPG makes it tractable.

What I found

The analysis produced thirty confirmed findings across eight CWE categories, with zero false positives when validated against the target application’s 107 documented security challenges. Some highlights:

Fifty-eight of 109 routes (53%) had no authentication middleware at all, including password change endpoints, admin panels, and every file upload route. The finding came from a single negative sub-traversal: enumerate all route handlers, filter out those with a security.* argument, report the rest.

The crypto implementation used MD5 with no salt for password hashing, a hardcoded HMAC key visible in source, and plaintext storage for TOTP secrets and credit card numbers. Joern’s taint analysis traced data flows from user input through to storage to confirm the findings.

Four file upload routes had validators that existed but did nothing: checkFileType and checkUploadSize both called next() unconditionally. The validators were present in the code, so a quick grep would see “validation exists,” but the CPG revealed that the validation functions never rejected anything.

The full results, including all 20 CPGQL queries and their outputs, are documented in the source repo.

What did not work

The approach has real limits. Joern’s JavaScript/TypeScript frontend is less mature than its Java and C support, and some queries that should have been straightforward required workarounds. GitNexus, a complementary code knowledge graph tool, contributed useful architectural context on eight of the twenty questions but could not match Joern’s expression-level precision for the CWE-targeted queries.

More fundamentally, the two-pass method still requires a human to decide which CWE categories to target and to evaluate whether the query results constitute genuine findings. The LLM formulated the CPGQL queries and reasoned about the results, but the analytical framework came from outside. This is closer to “LLM-assisted analysis” than “automated vulnerability detection,” and I think that distinction matters.

Where this goes

This was an exploration, not a finished tool, and there are open questions worth pursuing. The immediate one is whether the two-pass pattern generalizes beyond Express to other frameworks and languages where Joern has stronger frontend support (Java, C/C++). The longer-term question is whether the structural query step can be made more autonomous, with the LLM selecting which CWE categories to investigate based on the Pass 1 profile rather than being told.

I have some ideas on both fronts. More to come when I get to it.

Resources

The full interactive presentation walks through the approach slide by slide, including the CPG layer visualizations, CPGQL query examples, and the complete findings table: cpg.blocksec.ca

The source repo contains all documentation, from initial tool evaluation through final validation, organized in reading order: BlockSecCA/llm-cpg-exploration

Tools and references from the project:

Joern, the open-source CPG engine used for all structural queries
joern-mcp, the MCP server that wraps Joern’s HTTP API and made it possible for the LLM to call queries as tool invocations
vulnerable-app, the target application: a debranded fork of an intentionally vulnerable Express app with 107 documented challenges, stripped of identifying markers to prevent LLM training data leakage during analysis
Yamaguchi, Golde, Arp, and Rieck, Modeling and Discovering Vulnerabilities with Code Property Graphs (2014), the original paper that introduced the CPG concept
Lekssays et al., LLMxCPG: A Framework for LLM-Driven Code Vulnerability Detection using Code Property Graphs (2025), which demonstrated that LLMs can learn CPGQL but left architectural discovery and negative queries unexplored
OWASP A06:2025 Insecure Design, the category that framed the entire analysis