Notes · Build Logs
Reinforcement learning, MLOps, quantitative finance — backed by code and real numbers.
8-node StateGraph with HyDE, cross-encoder reranking, RL bandit retriever selection, Self-RAG loops, and a supervisor-routed multi-agent layer. The full architecture behind ChatBout AI.
A walkthrough of the ChatBout AI backend: an 8-node LangGraph RAG pipeline (classify, query_transform, retrieve, rerank, grade, generate, hallucination_check) plus a supervisor-pattern multi-agent system that chains RAG, Code, Analysis, and Chitchat agents. Covers HyDE query expansion, cross-encoder reranking (FAISS top-20 to ms-marco-MiniLM top-4), RL Thompson Sampling for retriever auto-selection, Self-RAG grounding loops, RAGAS evaluation, and 11 FastAPI endpoints deployed with Docker.
When more data hurts performance, the answer is not more data — it is smarter grouping, regime detection, and a bandit that learns which groups to trust.
13 rounds of QR-DQN experiments hit a wall: the best model (R6, +5.857) used only 44 envs, and scaling to 449 symbols made things worse. The fix was an ensemble that splits 450 symbols into 19 GICS sector groups, runs HMM regime detection per group, dispatches to regime-conditioned strategies (momentum / QR-DQN / defensive shorts), and uses Thompson Sampling to learn which groups produce profitable signals. 275-day OOS backtest: Sharpe 1.97, +20.80% return, 4.93% MDD.
Cover extraction, 6-product API automation, and why I chose Gumroad over LemonSqueezy.
A step-by-step account of building two technical ebooks, automating 6 Gumroad products (two books × three languages) with Node.js scripts, extracting cover images from PDF first pages with pdftoppm, hosting samples on GitHub Pages, and why LemonSqueezy's API limitations pushed me to Gumroad.
The follow-up: I promised to wire the tail-risk estimate into the sizing layer. Here's the opt-in, one-line patch that does it.
QR-DQN gave me a CVaR number. A number is not a strategy. This post wires that CVaR into the actual PortfolioSimulator path via a four-tier scaler (target / scale / floor / veto), adds a mixin so every RL signal class inherits the capability, and keeps the change fully opt-in — existing backtests stay bit-for-bit identical unless you drop a QR-DQN checkpoint into the cache.
Why expected Q-values aren't enough, and what 51 quantiles get you that SAC can't.
SAC and PPO learn the expected return only — they cannot tell apart a stable +0.2 from a +0.2 mean with a fat negative tail. QR-DQN learns the full return distribution per action, so CVaR / Expected Shortfall fall out for free. I implemented it in my pair-trading stack and ran a 3-way SAC vs PPO vs QR-DQN benchmark with real CVaR columns.
Prioritized Experience Replay in 210 lines of numpy. No dependencies, O(log N) sampling, full integration with twin-critic SAC.
Uniform replay is wasteful: rare but important transitions (like regime breaks) take forever to re-surface. Schaul 2016's Prioritized Experience Replay fixes that with an O(log N) SumTree sampler and importance-sampling weights that preserve unbiasedness. This post walks through the 170-line implementation, the SumTree invariants, the twin-critic TD-error aggregation, and the one-line switch in SACTrainer.
Why 'we chose SAC' is a better answer than 'we used SAC.' Building a 3-seed comparison that survives an interview.
I've been running SAC for four years. When asked why, the only honest answer was 'it worked first.' That answer fails an interview. This post documents the afternoon I built a proper PPO from scratch (Clipped Objective + GAE(λ) + KL early-stop + Tanh-squashed Gaussian), wired it into the same env SAC uses, and ran a 3-seed comparison. The result isn't just numbers — it's a defensible design choice.
An opt-in tracker that's a no-op without config, a Postgres-backed deployment, and a measured CPU latency win from 10 lines of actor changes.
Two upgrades, same afternoon. MLflow tracking wrapped as a context-manager with graceful no-op when MLFLOW_TRACKING_URI is unset — keeps the core trainer free of hard dependencies. torch.jit.script on the SAC actor after three small compatibility edits — measured ×2.13 speedup at batch size 1 on CPU, falling to ×1.08 at batch 128. The MLflow server runs on a docker-compose with Postgres so runs persist across deploys.
A tiny serving layer for RL agents that picks up new training runs without a restart. No watchers, no pub/sub, no Kubernetes.
The training loop drops a new `.pt` into cache/models/rl/stat_pair/ every few hours. The serving process should pick it up — without a restart, without a file watcher, without race conditions. I built a 270-line FastAPI router that does this with one `os.stat()` call per request. Seven end-to-end tests cover cold start, auto-swap, forced reload, and shape validation.