跳转到主内容
AIpricly

Methodology

Last updated: 2026-05-09 · methodology v2.4

Every number on this site has a verifiable source. This page documents where the data comes from, how often it updates, and what assumptions go into our derived metrics.

Pricing data

Primary source: OpenRouter's public model catalog. Daily fetch + per-day snapshot committed to the repo, so every price change is git-traceable.

Cross-check source: LiteLLM's open-source pricing table (MIT license, community-maintained). When OR and LiteLLM disagree by more than 5% on any model's input or output rate, our build emits a drift report and the CI gate fails until a maintainer reconciles the two — see /data-sources for the live drift status.

Update mechanism: a GitHub Action runs at 06:00 UTC every day, fetches both feeds, and opens a pull request with the diff. A maintainer reviews the PR before merging — no automated direct commits.

Augmentation: when neither feed has an entry (recently-released models, Chinese providers, niche frontier releases), we add it manually in data/models/augments.json. All augments are version-controlled and reviewable.

Quality scores

Source: the “Quality” column shows the Artificial Analysis Intelligence Index — a third-party composite of MMLU, GPQA, HumanEval, MATH, and other benchmarks. The “Arena Elo” field, when present, is sourced from LMArena.

Provenance protocol: every quality number on the site carries one of three states, visible on model-detail pages as a small chip next to the score:

  • measured — sourced from the partner feed with a verifiable date.
  • editorial estimate — our best read of public benchmark threads while waiting on partner data. No measurement claim, no fabricated precision.
  • absent — not yet scored. The model still renders, but its quality column is blank rather than a fabricated zero.

This protocol exists because the alternative — silently presenting hand-typed numbers as measurements — was the failure pattern an internal audit flagged in May 2026.

Why we cite a third party instead of running our own benchmarks: running representative LLM benchmarks costs thousands of dollars in compute per release cycle, and methodology drift is a real concern. Artificial Analysis and LMArena publish methodologies that are transparent and frequently audited — we trust their stewardship more than we'd trust ours.

Speed & latency

Current state: our continuous latency benchmark (scripts/benchmark-latency.ts) is implemented but not yet running in CI — the displayed tokens-per-second and P50 first-token values come from vendor-published throughput tables and Artificial Analysis snapshots, marked “vendor-reported” in the UI.

Future state: once the benchmark cron has accumulated a few weeks of independent measurements per model, the chips will flip from vendor-reported to measured. Real-world numbers vary with prompt length, region, server load, and time of day. Treat the displayed values as order-of-magnitude estimates, not SLAs.

Monthly cost calculations

The calculator and per-scenario cost figures use this formula:

monthly_cost =
  (input_tokens × input_rate × cache_factor × batch_factor) +
  (output_tokens × reasoning_multiplier × output_rate × batch_factor) +
  (image_inputs × image_price)

Where cache_factor = (1 - cache_hit) + cache_hit × 0.10 (assumes 90% off cached reads) and batch_factor = (1 - batch_ratio) + batch_ratio × 0.50 (50% off Batch API).

These are upper bounds — your actual cost is usually lower because real workloads have higher cache hit rates than assumed.

“Right now” picks

  • Cheapest: lowest input + output price across all models in the selected modality.
  • Strongest: highest Quality (AA Index) score.
  • Fastest: highest sustained tokens-per-second.
  • Best Value: highest quality / cost ratio, excluding the three extremes above.

What's missing

  • AA Partner Key. We're currently consuming Artificial Analysis data via publicly visible leaderboards. A direct partner feed (with verified dates per model) would let us promote every score from “editorial estimate” to “measured.” In progress.
  • Continuous latency cron. Implementation lives in scripts/benchmark-latency.ts; CI activation requires a token budget for outbound OpenRouter calls.
  • Six-month price history. We started snapshotting daily on the date of our first public commit, so deep history takes time to accumulate. Stale historical numbers from before that date are not reconstructed.
  • Per-language quality breakdowns (some models excel in EN but struggle in CJK).
  • Real latency from edge locations.

Found an error or have a methodology suggestion? Open an issue on GitHub.