07 — LLM-assisted security review with 'ask'

See: story index | Audit findings: 05-security | Tool: ask

What 'ask' is

A local Python command-line tool at /home/john/py/ask/ask_with_files_structured.py. It takes a list of files (source code, docs, PDFs, etc.), embeds them with their filenames as context, and sends the whole package to an OpenAI model with a system prompt and question. Output goes to stdout or a file.

Basic invocation for a security audit:

cd /home/john/git/pwsafe
python3 /home/john/py/ask/ask_with_files_structured.py \
    -m o3 \
    -s "You are a security expert auditing C++ code for vulnerabilities. \
         Be thorough. Report every finding with severity (Critical/High/Medium/Low/Info), \
         the exact location, the impact, and a concrete fix." \
    -q "Audit this code for security vulnerabilities." \
    src/os/transport.h \
    src/os/unix/transport.cpp \
    src/os/unix/transport_lockd.cpp \
    src/os/plugins/webdav/transport-webdav.cpp \
    src/os/plugins/file/transport-file.cpp \
    src/core/file.cpp \
    src/os/unix/file.h \
    | tee /tmp/audit-o3.txt

The same file list was then sent to gpt-5.2 for an independent second opinion, and a third manual pass was done by the developer (Claude) looking for anything the models had missed.

What the models found

The two model runs produced 35 distinct findings between them before deduplication, covering the full severity range from Critical to Info. Many were genuine, actionable issues:

• o3 found the four Critical issues: newline injection in the IPC protocol (C1), world-readable cache directory (C2), TOCTOU race in plugin loading (C3), and unrestricted curl redirect (C4). These were the most dangerous bugs in the codebase — any one of them could have caused data loss or arbitrary file overwrite on a user's machine.

• gpt-5.2 found the unbounded allocation in recv_string (H9) — a different class of bug (child process OOM crash leading to held server lock) that o3 had not highlighted.

• Both models independently flagged missing SSL verification and lock response buffer sizing.

• The manual self-review found SOCK_CLOEXEC missing — a file-descriptor inheritance bug that both models missed.

Full details: security_audits

Pros

Fresh eyes, no familiarity bias. The models had never seen this codebase before. A developer who wrote the IPC protocol might never think to feed it a URL containing \n — the model spots it immediately because it reasons from first principles about untrusted input.

Systematic coverage. The model reads every line. A human reviewer doing a time-boxed review naturally skims. The model found C3 (TOCTOU) in the plugin loader, which sits in a section of code that 'looks boring' — it's just file opens and stat calls.

Multiple independent models. Using o3 and gpt-5.2 as separate reviewers reduced the chance of systematic gaps. They caught overlapping issues (validating the finding) and complementary issues (extending coverage). Cross-model agreement on a finding was a strong signal that it was real.

Speed and cost. Both audits ran in minutes and cost a few dollars of API credit. A comparable human security review of ~1000 lines of security-sensitive C++ would take hours and cost significantly more.

Concrete output. The models didn't just say 'this looks bad' — they named the exact function, explained the attack vector with a concrete exploit example, and proposed a specific fix. The fix for C1 (switch to binary frames) was exactly what was implemented.

Raised the bar for self-review. After reading the model findings, the subsequent manual self-review was sharper. Knowing the models had focussed on IPC protocol and memory safety, the self-review deliberately looked at a different layer — fd lifecycle — which is where SOCK_CLOEXEC was found.

Cons

Hallucination and false positives require triage. Not every finding was real or actionable. Several Low/Info findings were technically correct observations but not practically exploitable, or described 'risks' that were ruled out by other constraints (e.g., world-readable .so in dev builds — the threat model for a dev machine is different). Expert triage is required before acting on findings.

Context blindness. The model cannot run the code, read /proc, or observe runtime behaviour. It missed the SOCK_CLOEXEC issue (fd inheritance) likely because that requires reasoning about the runtime process tree — child processes spawned by the GUI after the socketpair is created — rather than just reading the source. Dynamic vulnerabilities (race conditions, timing windows) are similarly hard for a static-only reviewer.

Systematic blind spots. Both models independently missed SOCK_CLOEXEC. This suggests a shared gap in how LLMs reason about Unix fd inheritance across fork/exec boundaries — a well-known footgun in systems programming. LLMs appear to review code more like a static analyser than like a systems programmer who asks 'what happens to this fd in every possible process state?'

No verification of fixes. The models propose fixes but cannot verify them. The recv_string fix (MAX_IPC_URL check before allocation) was suggested correctly, but a model suggesting the fix has no way to confirm it doesn't break the protocol in edge cases. The developer must own the fix.

Volume management. 19 findings from one model, 16 from another. Many were Low or Info. Without discipline in triage, the volume of output can be overwhelming and lead to important issues being lost in noise. We adopted a tiered approach: fix all Critical and High immediately, triage Medium, document known-limitation Low/Info.

Models do not know your threat model. Several findings were technically correct but misaligned with the actual deployment scenario (single-user personal password manager on a trusted machine). The model correctly noted that world-readable plugin .so files in dev builds are a risk — but the threat model for a dev machine is not the same as a production server. Contextualising findings requires human judgement.

Verdict

LLM-assisted review with 'ask' is a high-value addition to a security process — but as a first pass, not a final pass. It found real Critical issues that would have been embarrassing (or worse) to ship. The cost-to-value ratio was excellent.

The right model is: LLM review → triage → fix → targeted self-review looking for what LLMs typically miss → consider formal verification or fuzzing for the highest-risk paths. The self-review step is not optional — SOCK_CLOEXEC would have been missed entirely otherwise.

version1
created2026-02-27