Byte-level encoder — 7.2M params, no tokenizer

BERT-class classification at MCU scale

Soto is a tiny, byte-level AI encoder that produces embeddings and classifications comparable to much larger models — at a fraction of the cost and with no data leaving your infrastructure.

Get a free API key Try it in the playground

Model size

27 MB

~7 MB int8 · ~60× smaller than BERT

Banking77 accuracy

86.3%

V8 MLP head, 77-class intent

Latency

8 ms

~150× faster than GPT-4o at 1,200 ms

Cost / 1M calls

~$0.01

~22,500× cheaper than GPT-4o

How Soto compares

Classification workload (Banking77, 77-class intent).

Model	Size	Latency	Notes
Soto V8 (ours)	27 MB	8 ms	one frozen encoder, swap heads per task
BERT-base	440 MB	30 ms	fine-tuned per task
DistilBERT	265 MB	18 ms	fine-tuned per task
Claude Sonnet	cloud only	1,200 ms	prompt-based
GPT-4o	cloud only	1,500 ms	prompt-based

How it works

1. Send text

POST raw UTF-8 bytes to /v1/embed or /v1/classify. No tokenizer, no vocabulary juggling.

2. Get a vector

576-dim pooled embedding (mean + max + std of chunk summaries) or top-k class predictions.

3. Deploy anywhere

Same checkpoint runs on a laptop CPU, a $5 MCU (int8), or our hosted API. Your data never trains a shared model.