Soto
Byte-level encoder — 7.2M params, no tokenizer

BERT-class classification at MCU scale

Soto is a tiny, byte-level AI encoder that produces embeddings and classifications comparable to much larger models — at a fraction of the cost and with no data leaving your infrastructure.

Model size
27 MB
~7 MB int8 · ~60× smaller than BERT
Banking77 accuracy
86.3%
V8 MLP head, 77-class intent
Latency
8 ms
~150× faster than GPT-4o at 1,200 ms
Cost / 1M calls
~$0.01
~22,500× cheaper than GPT-4o

How Soto compares

Classification workload (Banking77, 77-class intent).

ModelSizeLatencyNotes
Soto V8 (ours)27 MB8 msone frozen encoder, swap heads per task
BERT-base440 MB30 msfine-tuned per task
DistilBERT265 MB18 msfine-tuned per task
Claude Sonnetcloud only1,200 msprompt-based
GPT-4ocloud only1,500 msprompt-based

How it works

1. Send text

POST raw UTF-8 bytes to /v1/embed or /v1/classify. No tokenizer, no vocabulary juggling.

2. Get a vector

576-dim pooled embedding (mean + max + std of chunk summaries) or top-k class predictions.

3. Deploy anywhere

Same checkpoint runs on a laptop CPU, a $5 MCU (int8), or our hosted API. Your data never trains a shared model.