Benchmarks
Every number on this page is measured, not claimed. The live specialist accuracies come directly from the backend's /tasks endpoint — test-set evaluation at startup, not training metrics.
Live specialist accuracy
Each row is a Soto V8 specialist encoder (7.2M params, 27 MB fp32) fine-tuned on the task's training set, measured on its held-out test set. Updated on every backend restart.
| Task | Classes | Test set | Accuracy | Status |
|---|---|---|---|---|
Assistant intent (54) Intent · hwu64 | 54 | 2,570 | 86.3% | production |
Banking intent Intent · banking77 | 77 | 1,500 | 86.9% | production |
General intent (150) Intent · clinc150 | 150 | 4,500 | 88.8% | production |
Movie review sentiment Sentiment · imdb | 2 | 5,000 | 76.2% | production |
Short sentiment Sentiment · sst2 | 2 | 872 | 70.5% | production |
Yelp review sentiment Sentiment · yelp | 2 | 5,000 | 82.9% | production |
News topic Topic · ag_news | 4 | 7,599 | 88.6% | production |
Question type Topic · trec | 6 | 500 | 88.2% | production |
Wikipedia topic Topic · dbpedia | 14 | 5,000 | 96.9% | production |
Fine-grained emotion Emotion · goemotions | 28 | 5,000 | 49.7% | beta |
Multilingual intent (60) Intent · massive | 60 | 5,000 | 40.1% | beta |
Movie critics sentiment Sentiment · rotten_tomatoes | 2 | 1,067 | 67.7% | beta |
Poem sentiment Sentiment · poem | 4 | 104 | 56.7% | beta |
Yahoo category Topic · yahoo | 10 | 5,000 | 59.7% | beta |
Banking77 head-to-head
The canonical classification benchmark we've optimized for. Competitor numbers are from published sources; Soto number is from our own test run.
| Model | Params | Size | Banking77 | $/1M | Latency |
|---|---|---|---|---|---|
| Soto V8 (ours) | 7.2M | 27 MB | 86.9% | $0.01 | 8 ms |
| DistilBERT (fine-tuned) | 66M | 265 MB | 90.1% | $0.35 | 18 ms |
| BERT-base (fine-tuned) | 110M | 440 MB | 93.1% | $0.55 | 30 ms |
| Claude Sonnet (few-shot) | — | cloud | ~92% | $225 | 1,200 ms |
| GPT-4o (few-shot) | — | cloud | ~93% | $225 | 1,500 ms |
BERT-base fine-tuned beats us by 6pp on accuracy; we beat it by 15× on disk size, 3× on latency, and 50× on cost. Different product for different deployments.
Coming soon — published benchmarks
- MTEB (Massive Text Embedding Benchmark) — retrieval, classification, clustering, STS across 56 tasks. Target: top 20 overall after retrieval V2 training.
- GLUE + SuperGLUE — classic NLU suites. Will report per-task scores and honest comparison to RoBERTa-base.
- BEIR — 18 retrieval datasets after MS MARCO + NLI + STS retrieval fine-tune.
- BLURB (biomedical) and LexGLUE (legal) — after domain-pretrained V8-bio and V8-legal.
Methodology, reproducibility instructions, and more notes in web/docs/BENCHMARKS.md.