2026-05-09

The Source Verification Playbook: Stopping AI Hallucinations Before Publish

Four deterministic checks plus an LLM tiebreak. The 0.8 confidence floor that keeps the bundle from shipping with a fabricated stat.

#ai
#source verification
#quality

The Source Verification Playbook: Stopping AI Hallucinations Before Publish

TL;DR

LLM-drafted content hallucinates citations roughly 8% of the time on our internal eval. A confident-sounding "according to a 2024 Stanford study" with a real-looking URL but a 404 body is the most common failure mode. We catch it before publish with four deterministic checks plus an LLM judgment tiebreak, gated at a 0.8 confidence floor. Below that floor, the bundle stops.

The full code lives in packages/agents/qyndex_agents/source_ledger.py. This post is the playbook version — the four checks, in order, with the trade-offs.

Section 1 — HEAD checks (the 30-second filter)

The cheapest check kills the most failures. Before anything else, the SourceVerifier issues an HTTP HEAD to every claimed URL.

async def head_check(url: str) -> HeadResult:
    try:
        async with httpx.AsyncClient(timeout=5.0) as c:
            resp = await c.head(url, follow_redirects=True)
        return HeadResult(
            ok=resp.status_code < 400,
            final_url=str(resp.url),
            status=resp.status_code,
        )
    except httpx.RequestError as e:
        return HeadResult(ok=False, final_url=url, status=0, error=str(e))

What this catches:

404s on fabricated URLs. The most common LLM failure mode. Models confabulate plausible-looking URLs; HEAD calls them out.
Redirect chains that land somewhere unexpected. The final_url field is what we cite, not the model's claimed URL — if the redirect goes to a paywall or a parked domain, the check fails.
Domain-wide outages. A bundle that cites three sources, one of which is briefly down, fails the gate. We retry once after 30s to absorb transient blips, then drop the source.

Performance: ~50ms per URL, 5 URLs/bundle on average, parallelised. Negligible cost vs. the LLM tiebreak that comes later.

Section 2 — Domain reputation

A 200 response from tinyurl.com/abc123 is not the same signal as a 200 from arxiv.org/abs/2401.00001. We score every final URL's domain against a curated reputation list.

DOMAIN_REPUTATION: dict[str, float] = {
    # Tier A: peer-reviewed, primary source
    "arxiv.org": 1.0,
    "doi.org": 1.0,
    "nature.com": 1.0,
    # Tier B: institutional / vendor primary
    "anthropic.com": 0.9,
    "openai.com": 0.9,
    "stanford.edu": 0.9,
    # Tier C: high-quality secondary
    "wikipedia.org": 0.7,
    "github.com": 0.7,
    # Tier D: blogs / unverified
    # ...defaults to 0.5
}

The list is hand-curated, not algorithmic — domain-rep services are a moving target and we would rather audit a 200-row YAML file than chase a vendor's API. Anything not on the list defaults to 0.5 ("unverified, but not blocked"). Anything on a banned-domains list (content farms, known-misinformation sources) scores 0.0 and the bundle stops.

The output is a per-source reputation_score in [0.0, 1.0] that feeds into the final confidence aggregate.

Section 3 — Freshness

A real, reachable, high-reputation URL can still be wrong if the content drifted. We score freshness with two signals:

HTTP Last-Modified header when present. A page last modified in 2018 is not a great citation for "2025 best practices".
Visible date in the page body (parsed via a regex pass over the rendered HTML). Required for blog/article-type sources; optional for "stable reference" sources like RFC documents.

def freshness_score(*, last_modified: datetime | None, claim_year: int) -> float:
    if last_modified is None:
        return 0.6  # neutral; many static refs lack the header
    age_years = (datetime.now(UTC) - last_modified).days / 365
    if age_years < 1:
        return 1.0
    if age_years < 3:
        return 0.8
    if age_years < 5:
        return 0.5
    return 0.2  # stale enough to flag, not auto-reject

Freshness alone never fails the gate (some claims are about historical events) — it is one input to the aggregate confidence score.

Section 4 — LLM judgment as tiebreak (NOT primary)

This is where most teams get the order wrong. They start with an LLM asking "is this source reliable?" and stop. We end with that question, only after the deterministic checks pass.

The tiebreak prompt is intentionally narrow:

You are a source verifier. The claim is: "{claim}".
 
The source URL is: {final_url} (HTTP {status}).
Domain reputation tier: {tier}.
Last modified: {last_modified or "unknown"}.
 
Read the source content and answer:
1. Does the source actually contain the claim? (yes/no)
2. If yes, is the claim attributed correctly?
3. Score your confidence in [0.0, 1.0].
 
Output ONLY the JSON: {"contains_claim": bool, "attribution_correct":
bool, "confidence": float, "reasoning": str}.

Why this is the tiebreak and not the primary check:

LLM judgment is unreliable on its own for citation-checking. Same model that drafted the citation cannot reliably score it.
LLM judgment is expensive — one call per source, and bundles cite 3–5 sources. A 10-bundle/day pipeline is 30–50 calls/day on the verification alone.
LLM judgment costs latency — adding 2s to every source check is fine offline, deadly in user-facing flows.

Routing: the tiebreak runs on Sonnet for most cases, escalating to Opus only when the deterministic signals disagree (HEAD-checked OK

fresh + medium reputation = ambiguous, send to Opus). About 12% of sources hit Opus; the rest stay on Sonnet.

Section 5 — The 0.8 confidence floor

The four checks above produce a per-source confidence score:

confidence = (
    0.30 * head_check_score        # 1.0 if HEAD ok, 0.0 if 404
    + 0.25 * reputation_score      # 0.0 - 1.0 from the curated map
    + 0.20 * freshness_score       # 0.0 - 1.0 from the age formula
    + 0.25 * llm_judgment_score    # 0.0 - 1.0 from the tiebreak
)

The bundle requires every cited source to score >= 0.8. Not the average — the minimum. One bad source kills the bundle.

The 0.8 floor is empirical:

Below 0.6: ~30% of sources had outright fabricated content. Reject.
0.6 - 0.8: ~8% had subtle problems (right URL, wrong attribution; outdated; gated paywall). Borderline; we reject to stay safe.
0.8 - 1.0: < 1% had problems. Acceptable.

The floor is a product decision, not a model-tuning hyperparam: we want zero hallucinated stats in published content, and 0.8 is where the false-pass rate hits zero on our eval.

Section 6 — What this is NOT

It is not a fact-checker. A claim like "the sky is blue" can cite a low-confidence source and the bundle should still ship; the model can verify against world knowledge. The Source Ledger checks citations, not claims.
It is not a plagiarism detector. Different tool, different problem. We trust the writer to attribute; we verify the attribution.
It does not run on every claim. Only on claims tagged [citation needed] by the writer agent — bare statements of reasoning skip the gate.

This is also the gate that backstops the Reflexion QC loop in our supervisor graph: a critic that scores quality but not factuality is a critic that ships hallucinations on a 90+ rubric. Source Ledger plugs that hole before QC's rubric runs.

Closing

Source verification is not a single LLM call — it is four deterministic checks plus a narrow LLM tiebreak, gated at 0.8 confidence. The deterministic checks catch the cheap failures (fabricated URLs, dead links, low-reputation domains, stale content) in milliseconds; the LLM tiebreak handles the ambiguous middle. The 0.8 floor is what stops the bundle.

For the printable version with the four-check sequence as a copy-paste runbook, grab the gated download.

For how Source Ledger plugs into the agent loop architecture and the cost-aware deployment patterns, see the linked posts — same supervisor graph, three different viewpoints.

Drafted by AI agents, reviewed by Shravan.