The $40,000 Alert That Never Fired

Your P99 latency just spiked to 4 seconds, but your static Prometheus thresholds didn't fire because the throughput dropped by 80% at the same time. By the time the on-call engineer woke up to a manual Slack complaint, you'd lost $40,000 in transactions. If this sounds familiar, it's because our industry is still obsessed with 2015-era monitoring logic in a 2026 world.

Traditional anomaly detection relies on 'batch-and-check.' You train a model on last week's data, wrap it in a Flask API, and poll it. In a high-velocity environment, that model is obsolete the moment it's deployed. We need systems that learn from every single event as it passes through the pipe.

Why Streaming ML is Non-Negotiable in 2026

In the current landscape, the 'Data Warehouse' is becoming the 'Data Archive.' Real-time operations have shifted to the stream. Whether you are using Redpanda, Kafka, or WarpStream, the goal is to move the intelligence closer to the producer.

Streaming ML (or Online Learning) differs from batch ML in three fundamental ways:

Incremental Updates: Models update their internal weights per-event. No expensive retraining jobs.
Zero-Lag Inference: Prediction happens on the wire, not via a round-trip to a database.
Concept Drift Handling: These models are designed to weight recent data more heavily, naturally adapting to shifting traffic patterns without manual intervention.

The Stack: River, Bytewax, and Redpanda

To build this, I use River (the successor to creme) for the online algorithms and Bytewax 0.22 for the stream processing framework. Bytewax is built on a Rust core (Timely Dataflow), which gives us the performance of Flink but with a Pythonic developer experience that actually allows for rapid iteration.

Component 1: The Online Model

For anomaly detection, I swear by Half-Space Trees (HST). Unlike Isolation Forests, which require a full dataset to build trees, HST builds a set of random trees and updates the mass of each node incrementally.

from river import anomaly
from river import compose
from river import preprocessing

We define a pipeline that scales features and applies HST

n_trees=10 and height=8 is a sweet spot for sub-ms inference

model = compose.Pipeline( preprocessing.StandardScaler(), anomaly.HalfSpaceTrees( n_trees=10, height=8, window_size=250, seed=42 ) )

Example of online learning

def process_event(event): score = model.score_one(event) model.learn_one(event) return score

Component 2: The Dataflow

Bytewax allows us to map this model over a distributed stream. Here is a production-ready snippet that connects to a Redpanda topic, processes metrics, and sinks anomalies to a dedicated 'alerts' topic.

import bytewax.operators as op
from bytewax.dataflow import Dataflow
from bytewax.connectors.kafka import KafkaSource, KafkaSink
from river import anomaly, compose, preprocessing

flow = Dataflow("anomaly-detection")

1. Source: Connect to Redpanda metrics

stream = op.input("redpanda-input", flow, KafkaSource(["localhost:9092"], ["system_metrics"]))

2. State Management: We need a model per metric type

def update_and_score(model_state, event): if model_state is None: model_state = compose.Pipeline( preprocessing.StandardScaler(), anomaly.HalfSpaceTrees(window_size=100) )

# Extract numeric values for the model
features = {"value": event["value"]}
score = model_state.score_one(features)
model_state.learn_one(features)

return model_state, {"metric": event["name"], "score": score}

op.stateful_map ensures the model is persistent across the stream for each key

scored_stream = op.stateful_map("detect", stream, update_and_score)

3. Filter: Only pass high-confidence anomalies (score > 0.8)

alerts = op.filter("threshold", scored_stream, lambda x: x["score"] > 0.8)

4. Sink: Send to an alerting topic

op.output("kafka-out", alerts, KafkaSink(["localhost:9092"], "anomaly-alerts"))

The Gotchas: What the Docs Don't Tell You

I’ve deployed this at scale (150k events per second), and it isn't all sunshine. Here is where you'll likely trip up:

The Feedback Loop of Doom: If your model detects an anomaly and your auto-scaler reacts by spinning up 50 nodes, the resulting traffic shift is itself an anomaly. Your model will flag the fix as a failure. You must include 'system action' metadata in your features so the model knows a change was intentional.
State Serialization: Bytewax checkpoints state for fault tolerance. If your River model object grows too large (e.g., too many trees), your checkpointing latency will kill your throughput. Keep your HST height and window sizes lean.
The Cold Start Problem: For the first 1,000 events, the model is 'guessing.' I usually implement a 'burn-in' period where scores are logged but not acted upon. Use a simple counter in your stateful map to suppress alerts during the first $N$ events.
Feature Normalization: Online models are hyper-sensitive to scale. If you don't use preprocessing.StandardScaler() (which also updates its mean/variance incrementally), a single outlier in the first 5 minutes will ruin your scoring for the next hour.

Evaluation in the Wild

You can't use standard cross-validation here. Instead, use Interleaved Test-Then-Train. For every event, you predict the anomaly score before you allow the model to learn from that event. This provides a rigorous, unbiased metric of how the model is performing in real-time.

Takeaway

Stop building batch pipelines for operational data. Today, take one high-cardinality metric (like per-customer API latency), pipe it into a local Bytewax script using the River HalfSpaceTrees model, and run it in 'shadow mode' alongside your existing alerts. You’ll find the anomalies your static thresholds have been missing for months.

Beyond Static Thresholds: Real-Time Anomaly Detection with Streaming ML