Benchmarks

Every number on this page is measured, not claimed. The live specialist accuracies come directly from the backend's /tasks endpoint — test-set evaluation at startup, not training metrics.

Live specialist accuracy

Each row is a Soto V8 specialist encoder (7.2M params, 27 MB fp32) fine-tuned on the task's training set, measured on its held-out test set. Updated on every backend restart.

Task	Classes	Test set	Accuracy	Status
Assistant intent (54) Intent · hwu64	54	2,570	86.3%	production
Banking intent Intent · banking77	77	1,500	86.9%	production
General intent (150) Intent · clinc150	150	4,500	88.8%	production
Movie review sentiment Sentiment · imdb	2	5,000	76.2%	production
Short sentiment Sentiment · sst2	2	872	70.5%	production
Yelp review sentiment Sentiment · yelp	2	5,000	82.9%	production
News topic Topic · ag_news	4	7,599	88.6%	production
Question type Topic · trec	6	500	88.2%	production
Wikipedia topic Topic · dbpedia	14	5,000	96.9%	production
Fine-grained emotion Emotion · goemotions	28	5,000	49.7%	beta
Multilingual intent (60) Intent · massive	60	5,000	40.1%	beta
Movie critics sentiment Sentiment · rotten_tomatoes	2	1,067	67.7%	beta
Poem sentiment Sentiment · poem	4	104	56.7%	beta
Yahoo category Topic · yahoo	10	5,000	59.7%	beta

Banking77 head-to-head

The canonical classification benchmark we've optimized for. Competitor numbers are from published sources; Soto number is from our own test run.

Model	Params	Size	Banking77	$/1M	Latency
Soto V8 (ours)	7.2M	27 MB	86.9%	$0.01	8 ms
DistilBERT (fine-tuned)	66M	265 MB	90.1%	$0.35	18 ms
BERT-base (fine-tuned)	110M	440 MB	93.1%	$0.55	30 ms
Claude Sonnet (few-shot)	—	cloud	~92%	$225	1,200 ms
GPT-4o (few-shot)	—	cloud	~93%	$225	1,500 ms

BERT-base fine-tuned beats us by 6pp on accuracy; we beat it by 15× on disk size, 3× on latency, and 50× on cost. Different product for different deployments.

Coming soon — published benchmarks

MTEB (Massive Text Embedding Benchmark) — retrieval, classification, clustering, STS across 56 tasks. Target: top 20 overall after retrieval V2 training.
GLUE + SuperGLUE — classic NLU suites. Will report per-task scores and honest comparison to RoBERTa-base.
BEIR — 18 retrieval datasets after MS MARCO + NLI + STS retrieval fine-tune.
BLURB (biomedical) and LexGLUE (legal) — after domain-pretrained V8-bio and V8-legal.

Methodology, reproducibility instructions, and more notes in web/docs/BENCHMARKS.md.