90% classification accuracy. ~200 lines of TypeScript. Zero ML libraries. No model weights. No training infrastructure. Just math that's been proven for 270 years.
ScrollOS is an email newsletter OS — it ingests Gmail and Outlook inboxes, organizes high-volume newsletters, and surfaces what actually matters. The core feature is intelligent classification: each newsletter is categorized (Tech, Finance, Fitness, Design, etc.) so users can switch between topics the way you switch between apps.
When I started building this, every classification approach I could find pointed to one of three options: a pre-trained model (GPT or a fine-tuned BERT variant), a third-party classification API, or a rules-based keyword matcher. I didn't want any of them. The model was overkill and introduced latency I didn't want. The API added cost at a scale that didn't pencil out ($0.002/email × 1,000 emails/user × 10,000 users = $20,000/month). The keyword matcher had 60% accuracy in testing.
So I built a Naive Bayes classifier from scratch. Here's every decision, every line of code, and the math behind why it works.
Why Bayes Theorem Works for Text Classification
The core idea: given the words in an email, what's the probability it belongs to each category? Bayes' theorem gives us exactly this:
The 'naive' independence assumption is almost certainly wrong — words are not independent. 'Machine' and 'learning' are correlated. But it doesn't matter. The classifier doesn't need to be correct about the joint distribution; it just needs to pick the right category. And it does, with remarkable consistency, because even if the absolute probabilities are wrong, the relative ordering between categories is usually right.
N-grams: Why Single Words Aren't Enough
A unigram model treats each word independently. 'deep' and 'learning' each contribute their individual probabilities. A bigram model includes 'deep learning' as a single feature. Trigrams add 'machine learning model', etc.
For newsletter classification, bigrams are significantly better than unigrams. 'Portfolio' alone is ambiguous — could be Design, Finance, or Tech. 'Portfolio management' is Finance. 'Portfolio site' is Design. 'Investment portfolio' is Finance. The bigram disambiguates what the single word cannot.
The Classifier Implementation
The classifier needs to do two things: learn from labeled examples (fit), and predict the category of new examples (predict). I store everything in plain TypeScript Maps — no external state, no database queries during classification.
Laplace Smoothing: Why You Need It
Without smoothing, any n-gram that appears in the test email but was never seen in training for a given category produces log(0) = -Infinity. That single unknown word kills the entire classification for that category, regardless of how much evidence points to it.
Laplace smoothing (add-1 smoothing) adds a count of alpha to every n-gram in every category, as if we'd seen each n-gram alpha times. This prevents zero probabilities while barely affecting the probability estimates for high-frequency n-grams.
Alpha = 1.0 works well for newsletters. If your classifier is overconfident on short texts, try alpha = 0.1. If it's underconfident on long texts with many rare words, try alpha = 2.0. We tested 0.1, 0.5, 1.0, 2.0 and 1.0 gave the best F1 score on our validation set.
Training Data Strategy
This is where most implementations fall apart: where do you get labeled training data? I used three sources:
- 1.Newsletter domain metadata: Substack, Morning Brew, TechCrunch, etc. have predictable categories. Build a domain→category map for the 200 most popular newsletter domains. This covers ~40% of newsletters with 100% accuracy (the domain IS the category).
- 2.User-confirmed labels: When users move a newsletter into a category folder, that's a training signal. Each user action updates the classifier's training data for that user.
- 3.Bootstrap seed data: 50 hand-labeled newsletters per category, covering the 8 categories (Tech, Finance, Design, Health, Productivity, News, Sports, Other). This gets you from nothing to working.
The combination of domain heuristics, seed data, and user feedback creates a self-improving system. New users start with the global model; as they use the app, their personal classifier adapts to their specific newsletter mix.
Per-User vs Global Classifier
ScrollOS runs one global classifier trained on all users' data plus one per-user classifier that's fine-tuned on their specific inbox. The per-user classifier handles idiosyncratic newsletters ('Fintech CEO digest' is Tech for most users, Finance for a VC).
Performance in Production
The <2ms latency is the killer feature. A GPT-based classifier would take 200–800ms per email. With 1,000 emails per user inbox sync, that's the difference between a 2-second operation and a 15-minute background job. The Bayesian classifier runs synchronously during inbox sync with negligible impact.
Where It Fails (and Why I'm OK With It)
The classifier fails on newsletters that span multiple categories ('The Generalist' covers Tech + Finance + Business), newsletters that are contextually dependent (a newsletter about 'the Fed' is Finance, but 'Fed Up' is Politics), and very short emails with under 20 words (insufficient signal).
For the 10% failure rate: the UI shows a confidence indicator. Low-confidence classifications get a 'Confirm?' prompt that turns into training data when the user corrects it. The 10% becomes a feature — it's how the model keeps improving.
Naive Bayes isn't the most accurate classifier in the world. It's the most accurate classifier you can build in a weekend, run in 2ms, deploy with no infrastructure, and explain to a non-ML engineer in 10 minutes. In production, those constraints matter more than academic benchmarks.
The Full Picture: What ~200 Lines Buys You
The core classifier — tokenizer, n-gram extractor, fit/predict, serializer — is 180 lines of TypeScript. Add the domain map, the two-stage classification logic, and the training update hook, and you're at about 280 lines total. That's it. No model files to ship, no GPU to provision, no third-party API to depend on, no cost that scales with usage.
Sometimes the 270-year-old algorithm is the right one.