Byte-level encoder — 7.2M params, no tokenizer
BERT-class classification at MCU scale
Soto is a tiny, byte-level AI encoder that produces embeddings and classifications comparable to much larger models — at a fraction of the cost and with no data leaving your infrastructure.
Model size
27 MB
~7 MB int8 · ~60× smaller than BERT
Banking77 accuracy
86.3%
V8 MLP head, 77-class intent
Latency
8 ms
~150× faster than GPT-4o at 1,200 ms
Cost / 1M calls
~$0.01
~22,500× cheaper than GPT-4o
How Soto compares
Classification workload (Banking77, 77-class intent).
| Model | Size | Latency | Notes |
|---|---|---|---|
| Soto V8 (ours) | 27 MB | 8 ms | one frozen encoder, swap heads per task |
| BERT-base | 440 MB | 30 ms | fine-tuned per task |
| DistilBERT | 265 MB | 18 ms | fine-tuned per task |
| Claude Sonnet | cloud only | 1,200 ms | prompt-based |
| GPT-4o | cloud only | 1,500 ms | prompt-based |
How it works
1. Send text
POST raw UTF-8 bytes to /v1/embed or /v1/classify. No tokenizer, no vocabulary juggling.
2. Get a vector
576-dim pooled embedding (mean + max + std of chunk summaries) or top-k class predictions.
3. Deploy anywhere
Same checkpoint runs on a laptop CPU, a $5 MCU (int8), or our hosted API. Your data never trains a shared model.