Skip to content
← Blog
6 min read

Consensus Verification: Six Models, One Verdict

Every AI verification service uses one model. We query Claude, GPT, Gemini, DeepSeek, Llama, Grok — plus live data from Yahoo, FRED, and SEC EDGAR. Six models guess. VEROQ knows.

Every AI verification service uses one model to check claims. We use six.

Single-model verification has a structural problem: when the model is wrong, it's often wrong in the same direction on both generation and verification. A student grading their own exam. The failure modes are correlated — same training data, same blind spots, same confident mistakes.

Consensus verification breaks the correlation. Six models, six sets of training data, six independent evaluations. Plus one source none of them have: real-time ground truth from Yahoo Finance, FRED, EIA, and SEC EDGAR.

Six AI models guess. VEROQ knows.

How It Works

One API call. Six models run in parallel. You get back per-model verdicts, a consensus score, and any dissenting opinions with corrections.

bash
curl -X POST https://api.veroq.ai/api/v1/verify/consensus \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"claim": "NVIDIA Q4 2024 revenue was $22.1 billion"}'

Response

json
{
  "consensus_verdict": "majority_supported",
  "consensus_score": 0.833,
  "confidence": 0.89,
  "models_queried": 6,
  "verdicts": [
    { "label": "Claude",   "verdict": "supported",    "confidence": 0.94 },
    { "label": "GPT",      "verdict": "supported",    "confidence": 0.91 },
    { "label": "Gemini",   "verdict": "supported",    "confidence": 0.88 },
    { "label": "DeepSeek", "verdict": "supported",    "confidence": 0.85 },
    { "label": "Llama",    "verdict": "supported",    "confidence": 0.79 },
    { "label": "Grok",     "verdict": "contradicted", "confidence": 0.82,
      "correction": "Q4 FY2025 revenue was $22.1B, not Q4 2024" }
  ],
  "dissent": [{ "label": "Grok", "verdict": "contradicted", ... }],
  "receipt_id": "vc_abc123"
}

The Seven Sources

ProviderModelWhy It's There
AnthropicClaudeStrong reasoning, conservative on uncertain claims
OpenAIGPTBroad knowledge, high recall on factual data
GoogleGeminiStrong on structured data, dates, numbers
DeepSeekDeepSeekDifferent training corpus, catches Western-centric blind spots
Meta (Groq)LlamaOpen-weight model, fast inference, different failure modes
xAIGrokReal-time X/Twitter data, catches stale claims
VEROQLive DataYahoo, FRED, EIA, SEC EDGAR — real-time ground truth

All six run in parallel. Total latency is the slowest model, not the sum. Cost is ~$0.002 per claim. Results are cached for 30 minutes — repeated checks are free.

Why Dissent Matters

The consensus score tells you how many models agree. But the dissent array is where the real value is. When 5 models say “supported” and one says “contradicted,” that correction is almost always worth reading.

Claim: “NVIDIA beat Q4 estimates by 12%”

5 models: SUPPORTED

Grok (DISSENT): “Beat estimates by 8.6%, not 12%. The 12% figure is the YoY revenue growth rate, not the earnings beat.”

This is a subtle error that a single-model verifier would miss — the claim is close enough to pass a similarity check, but the specific number is wrong. Cross-model consensus catches it because the models are wrong in different ways.

The Ground Truth Advantage

There's a category of error that no amount of multi-model consensus catches: stale data. Every LLM has a knowledge cutoff. Ask six models “What is NVIDIA trading at?” and you get six versions of “I don't have real-time data.”

Claim: “NVIDIA is trading at $280”

Claude, GPT, Gemini, DeepSeek, Llama, Grok: UNVERIFIABLE

VEROQ Live Data: CONTRADICTED — actual price $258.30, checked 2 minutes ago

VEROQ is the 7th source in consensus verification. It checks claims against live data from Yahoo Finance (prices, fundamentals), FRED (economic indicators), EIA (energy data), and SEC EDGAR (filings, insider trades). When the models can't verify, VEROQ often can — because it has the actual number, not a training-data approximation.

This is the difference between “consensus among guesses” and “consensus plus ground truth.”

Using It

Python SDK

python
from veroq import Veroq

client = Veroq()

# Consensus verification on a single claim
result = client.verify.consensus(
    claim="Tesla delivered 1.8M vehicles in 2024"
)

print(result.consensus_verdict)  # "majority_supported"
print(result.consensus_score)    # 0.833
print(result.dissent)            # models that disagree

# Shield any LLM output with consensus
from veroq import shield

result = shield(
    llm_output,
    consensus=True  # use 6-model consensus
)

As an Agent Guardrail

python
from agents import Agent, Runner
from veroq_agentmesh import veroq_consensus_guardrail

agent = Agent(
    name="Analyst",
    output_guardrails=[veroq_consensus_guardrail(
        min_models=4,
        block_on_dissent=True
    )],
)

# Blocks if any model contradicts a claim
result = await Runner.run(agent, "Summarize AAPL earnings")

When to Use Consensus vs Standard Verification

ScenarioRecommended
Real-time agent responsesStandard /verify (faster, cheaper)
Financial claims in reportsConsensus (highest accuracy)
CI/CD pipeline checksConsensus (catch subtle errors before deploy)
Customer-facing contentConsensus (dissent catches nuance)
Internal research notesStandard /verify (good enough)
Regulatory/complianceConsensus (auditable multi-model trail)

Consensus verification is live now. Free tier includes 1,000 credits/month.