SciRouter public benchmarks

Three benchmarks — two small-molecule (TDC ADMET + MoleculeNet), one for the literature-confirmation engine itself (PubMed Q&A consistency). Methodology and the full audit trail are published alongside the scores. Failures and gaps are surfaced, not hidden.

v1 shows the benchmark cards with placeholder scores. The actual runners land in P7-F follow-up — TDC + MoleculeNet require local held-out evaluation; PubMed Q&A consistency needs the literature- confirmation engine to accumulate enough dossiers to measure against the expert key.

TDC ADMET (Caco-2, BBB, hERG, AMES)

Awaiting v1 run

Four small-molecule ADMET tasks from the Therapeutics Data Commons. Tests the gateway's `predict_admet` primitive against held-out TDC sets.

Metric

Mean AUROC

SciRouter

—

Reference baseline

0.82 (TDC reference)

Benchmark runner lands in P7-F follow-up. Will publish per-task AUROCs + failure cases (where the primitive is wrong) alongside the headline number.

Source: Therapeutics Data Commons ↗

MoleculeNet (BACE / BBBP / ToxCast)

Awaiting v1 run

Three small-molecule activity prediction tasks from MoleculeNet. Lower bound for the `generate_molecules` + `mol_properties` pipeline.

Metric

ROC-AUC (mean over 3 splits)

SciRouter

—

Reference baseline

0.74 (MoleculeNet GNN ref)

Same runner pattern as TDC. Per-task breakdowns + failed predictions surfaced in detail page.

Source: MoleculeNet ↗

PubMed Q&A consistency

Awaiting v1 run

100 cross-species oncology questions; the literature-confirmation engine builds a dossier per question, scored vs an expert-curated answer key. Eval-set composition: ≥15 veterinary-oncology questions from COTC + Morris Animal Foundation + VetCompass cohort studies.

Metric

Status agreement (confirmed / contested / unsupported)

SciRouter

—

Reference baseline

—

This benchmarks the literature-triangulation engine itself, not a primitive. Lands in P7-F follow-up after the daily-cron + paper-ingest hook produces enough dossiers to measure.

Source: SciRouter (internal expert-curated)

Vet-ADMET holdout (canine + feline PK)

Awaiting v1 run (50-compound holdout in P7-L follow-up)

50 compounds with published canine and/or feline PK parameters drawn from Morris Animal Foundation + COTC trial literature + veterinary pharmacology references. Tests how well `predict_admet` (human-trained underneath) predicts canine Vd / Cl / T₁/₂ / oral bioavailability vs. observed vet PK. THE comparative-oncology moat benchmark — honest about where human-trained ADMET breaks for vet medicine is the value prop.

Metric

Per-species R² + bias direction

SciRouter

—

Reference baseline

—

v1 ships the card + the methodology — surfacing per-species PK gaps with P-gp / UGT1A6 / CYP2D15 risk chips per compound. Runner lands in P7-L follow-up alongside the curated 50-compound holdout set.

Source: SciRouter (internal curated)

Benchmarks v1 · 3 standard + 1 vet-specific holdout · placeholders with methodology. Real scores publish as runners land. Both passes and failures included — no cherry-picking is a hard product rule.

How the engine actually scores.

TDC ADMET (Caco-2, BBB, hERG, AMES)

MoleculeNet (BACE / BBBP / ToxCast)

PubMed Q&A consistency

Vet-ADMET holdout (canine + feline PK)

🐾 Species-specific PK risks the vet-ADMET holdout flags