Benchmarks

Every number on this page is measured, not claimed. The live specialist accuracies come directly from the backend's /tasks endpoint — test-set evaluation at startup, not training metrics.

Live specialist accuracy

Each row is a Soto V8 specialist encoder (7.2M params, 27 MB fp32) fine-tuned on the task's training set, measured on its held-out test set. Updated on every backend restart.

TaskClassesTest setAccuracyStatus
Assistant intent (54)
Intent · hwu64
542,57086.3%production
Banking intent
Intent · banking77
771,50086.9%production
General intent (150)
Intent · clinc150
1504,50088.8%production
Movie review sentiment
Sentiment · imdb
25,00076.2%production
Short sentiment
Sentiment · sst2
287270.5%production
Yelp review sentiment
Sentiment · yelp
25,00082.9%production
News topic
Topic · ag_news
47,59988.6%production
Question type
Topic · trec
650088.2%production
Wikipedia topic
Topic · dbpedia
145,00096.9%production
Fine-grained emotion
Emotion · goemotions
285,00049.7%beta
Multilingual intent (60)
Intent · massive
605,00040.1%beta
Movie critics sentiment
Sentiment · rotten_tomatoes
21,06767.7%beta
Poem sentiment
Sentiment · poem
410456.7%beta
Yahoo category
Topic · yahoo
105,00059.7%beta

Banking77 head-to-head

The canonical classification benchmark we've optimized for. Competitor numbers are from published sources; Soto number is from our own test run.

ModelParamsSizeBanking77$/1MLatency
Soto V8 (ours)7.2M27 MB86.9%$0.018 ms
DistilBERT (fine-tuned)66M265 MB90.1%$0.3518 ms
BERT-base (fine-tuned)110M440 MB93.1%$0.5530 ms
Claude Sonnet (few-shot)cloud~92%$2251,200 ms
GPT-4o (few-shot)cloud~93%$2251,500 ms

BERT-base fine-tuned beats us by 6pp on accuracy; we beat it by 15× on disk size, 3× on latency, and 50× on cost. Different product for different deployments.

Coming soon — published benchmarks

  • MTEB (Massive Text Embedding Benchmark) — retrieval, classification, clustering, STS across 56 tasks. Target: top 20 overall after retrieval V2 training.
  • GLUE + SuperGLUE — classic NLU suites. Will report per-task scores and honest comparison to RoBERTa-base.
  • BEIR — 18 retrieval datasets after MS MARCO + NLI + STS retrieval fine-tune.
  • BLURB (biomedical) and LexGLUE (legal) — after domain-pretrained V8-bio and V8-legal.

Methodology, reproducibility instructions, and more notes in web/docs/BENCHMARKS.md.