Notes · Build Logs
Reinforcement learning, MLOps, and quantitative finance — grounded in a four-year solo build of a SAC pair-trading system. Every post links to runnable code and real benchmark numbers.
The follow-up: I promised to wire the tail-risk estimate into live sizing. Here's the opt-in, one-line patch that does it.
QR-DQN gave me a CVaR number. A number is not a strategy. This post wires that CVaR into the actual PortfolioSimulator path via a four-tier scaler (target / scale / floor / veto), adds a mixin so every RL signal class inherits the capability, and keeps the change fully opt-in — existing backtests stay bit-for-bit identical unless you drop a QR-DQN checkpoint into the cache.
Why expected Q-values aren't enough, and what 51 quantiles get you that SAC can't.
SAC and PPO learn the expected return only — they cannot tell apart a stable +0.2 from a +0.2 mean with a fat negative tail. QR-DQN learns the full return distribution per action, so CVaR / Expected Shortfall fall out for free. I implemented it in my pair-trading stack and ran a 3-way SAC vs PPO vs QR-DQN benchmark with real CVaR columns.
Prioritized Experience Replay in 210 lines of numpy. No dependencies, O(log N) sampling, full integration with twin-critic SAC.
Uniform replay is wasteful: rare but important transitions (like regime breaks) take forever to re-surface. Schaul 2016's Prioritized Experience Replay fixes that with an O(log N) SumTree sampler and importance-sampling weights that preserve unbiasedness. This post walks through the 170-line implementation, the SumTree invariants, the twin-critic TD-error aggregation, and the one-line switch in SACTrainer.
Why 'we chose SAC' is a better answer than 'we used SAC.' Building a 3-seed comparison that survives an interview.
I've been running SAC for four years. When asked why, the only honest answer was 'it worked first.' That answer fails an interview. This post documents the afternoon I built a proper PPO from scratch (Clipped Objective + GAE(λ) + KL early-stop + Tanh-squashed Gaussian), wired it into the same env SAC uses, and ran a 3-seed comparison. The result isn't just numbers — it's a defensible design choice.
An opt-in tracker that's a no-op without config, a Postgres-backed deployment, and a measured CPU latency win from 10 lines of actor changes.
Two upgrades, same afternoon. MLflow tracking wrapped as a context-manager with graceful no-op when MLFLOW_TRACKING_URI is unset — keeps the core trainer free of hard dependencies. torch.jit.script on the SAC actor after three small compatibility edits — measured ×2.13 speedup at batch size 1 on CPU, falling to ×1.08 at batch 128. The MLflow server runs on a docker-compose with Postgres so runs persist across deploys.
A tiny serving layer for RL agents that picks up new training runs without a restart. No watchers, no pub/sub, no Kubernetes.
The training loop drops a new `.pt` into cache/models/rl/stat_pair/ every few hours. The serving process should pick it up — without a restart, without a file watcher, without race conditions. I built a 270-line FastAPI router that does this with one `os.stat()` call per request. Seven end-to-end tests cover cold start, auto-swap, forced reload, and shape validation.