240 runs. 4 vendors. Facts an agent can't recall.
We gave four models the same dependency decisions twice — once blind, once with authoritative capability facts they couldn't recall from training. Correct decisions moved from 20% to 78%, and the lift held across every vendor — not a single-model artifact.
| Model | Control | With facts | Lift |
|---|---|---|---|
| Anthropic Opus 4.8 | 20% | 80% | +60pp |
| OpenAI GPT-5.5 | 30% | 80% | +50pp |
| Google Gemini 3 Pro | 13% | 80% | +67pp |
| DeepSeek V4 Pro | 17% | 70% | +53pp |
The lift holds across Anthropic, OpenAI, Google, and a cheap open-weight model — +50 to +67pp each. It is a property of the facts, not of any one vendor.
Ten synthetic packages, each with a known-correct dependency choice, were put to four models — Anthropic Opus 4.8, OpenAI GPT-5.5, Google Gemini 3 Pro, and DeepSeek V4 Pro — under two conditions: no facts (control) and with Starlog's capability facts (treatment). Three repetitions each, 240 runs in total, graded GO with zero classifier fallbacks.
Caveat one: the packages are synthetic by construction, so the model cannot have memorized them. That makes the 20%→78% swing a best-case ceiling — a clean measurement of what authoritative facts add when recall is impossible, not a deployment estimate.
Caveat two: in the treatment condition the agent is told to call the facts tool. This is a value test by design — it measures what the facts are worth once delivered, not how often an agent chooses to reach for them. On clean control packages the agent stayed correct 36/36 with facts versus 1/36 without, which is why we read the swing as signal, not an answer key.
20% → 78%
correct decisions, control vs facts
240
runs across 4 vendors
+50–67pp
lift, every vendor
npx starloghq init