Benchmarks — How Facts Move Agent Decisions

Proof

240 runs. 4 vendors. Facts an agent can't recall.

We gave four models the same dependency decisions twice — once blind, once with authoritative capability facts they couldn't recall from training. Correct decisions moved from 20% to 78%, and the lift held across every vendor — not a single-model artifact.

Model	Control	With facts	Lift
Anthropic Opus 4.8	20%	80%	+60pp
OpenAI GPT-5.5	30%	80%	+50pp
Google Gemini 3 Pro	13%	80%	+67pp
DeepSeek V4 Pro	17%	70%	+53pp

The lift holds across Anthropic, OpenAI, Google, and a cheap open-weight model — +50 to +67pp each. It is a property of the facts, not of any one vendor.

Ten synthetic packages, each with a known-correct dependency choice, were put to four models — Anthropic Opus 4.8, OpenAI GPT-5.5, Google Gemini 3 Pro, and DeepSeek V4 Pro — under two conditions: no facts (control) and with Starlog's capability facts (treatment). Three repetitions each, 240 runs in total, graded GO with zero classifier fallbacks.

Caveat one: the packages are synthetic by construction, so the model cannot have memorized them. That makes the 20%→78% swing a best-case ceiling — a clean measurement of what authoritative facts add when recall is impossible, not a deployment estimate.

Caveat two: in the treatment condition the agent is told to call the facts tool. This is a value test by design — it measures what the facts are worth once delivered, not how often an agent chooses to reach for them. On clean control packages the agent stayed correct 36/36 with facts versus 1/36 without, which is why we read the swing as signal, not an answer key.

20% → 78%

correct decisions, control vs facts

240

runs across 4 vendors

+50–67pp

lift, every vendor

Opus 4.8GPT-5.5Gemini 3 ProDeepSeek V4 Pro

$npx starloghq init

Get Started →