Notes · Build Logs

Writing about
what I built.

Reinforcement learning, MLOps, quantitative finance — backed by code and real numbers.

2026-04-28 · 8 min · English

Building a Production RAG + Multi-Agent System with LangGraph

8-node StateGraph with HyDE, cross-encoder reranking, RL bandit retriever selection, Self-RAG loops, and a supervisor-routed multi-agent layer. The full architecture behind ChatBout AI.

A walkthrough of the ChatBout AI backend: an 8-node LangGraph RAG pipeline (classify, query_transform, retrieve, rerank, grade, generate, hallucination_check) plus a supervisor-pattern multi-agent system that chains RAG, Code, Analysis, and Chitchat agents. Covers HyDE query expansion, cross-encoder reranking (FAISS top-20 to ms-marco-MiniLM top-4), RL Thompson Sampling for retriever auto-selection, Self-RAG grounding loops, RAGAS evaluation, and 11 FastAPI endpoints deployed with Docker.

LangGraphRAGMulti-AgentLangChainPythonFastAPILLM
2026-04-28 · 10 min · English

Ensemble RL Pair Trading — From QR-DQN Plateau to Sharpe 1.97

When more data hurts performance, the answer is not more data — it is smarter grouping, regime detection, and a bandit that learns which groups to trust.

13 rounds of QR-DQN experiments hit a wall: the best model (R6, +5.857) used only 44 envs, and scaling to 449 symbols made things worse. The fix was an ensemble that splits 450 symbols into 19 GICS sector groups, runs HMM regime detection per group, dispatches to regime-conditioned strategies (momentum / QR-DQN / defensive shorts), and uses Thompson Sampling to learn which groups produce profitable signals. 275-day OOS backtest: Sharpe 1.97, +20.80% return, 4.93% MDD.

Reinforcement LearningPair TradingEnsemble MethodsThompson SamplingHMMQuantitative Finance
2026-04-21 · 6 min · English

Building and Selling Two Ebooks on Gumroad — From Zero to 6 Live Products

Cover extraction, 6-product API automation, and why I chose Gumroad over LemonSqueezy.

A step-by-step account of building two technical ebooks, automating 6 Gumroad products (two books × three languages) with Node.js scripts, extracting cover images from PDF first pages with pdftoppm, hosting samples on GitHub Pages, and why LemonSqueezy's API limitations pushed me to Gumroad.

EbookGumroadSvelteKitGitHub PagesAutomationSide Project
2026-04-17 · 9 min · English

CVaR-Aware Position Sizing — Turning QR-DQN Quantiles into a Sizing Multiplier

The follow-up: I promised to wire the tail-risk estimate into the sizing layer. Here's the opt-in, one-line patch that does it.

QR-DQN gave me a CVaR number. A number is not a strategy. This post wires that CVaR into the actual PortfolioSimulator path via a four-tier scaler (target / scale / floor / veto), adds a mixin so every RL signal class inherits the capability, and keeps the change fully opt-in — existing backtests stay bit-for-bit identical unless you drop a QR-DQN checkpoint into the cache.

Reinforcement LearningQuantitative FinanceCVaRRisk ManagementPython
2026-04-16 · 9 min · English

From DQN to QR-DQN — Distributional RL for Tail Risk in Pair Trading

Why expected Q-values aren't enough, and what 51 quantiles get you that SAC can't.

SAC and PPO learn the expected return only — they cannot tell apart a stable +0.2 from a +0.2 mean with a fat negative tail. QR-DQN learns the full return distribution per action, so CVaR / Expected Shortfall fall out for free. I implemented it in my pair-trading stack and ran a 3-way SAC vs PPO vs QR-DQN benchmark with real CVaR columns.

Reinforcement LearningDistributional RLQuantitative FinancePyTorchCVaR
2026-04-15 · 7 min · English

PER from Scratch — SumTree, Importance-Sampling Weights, and β Annealing

Prioritized Experience Replay in 210 lines of numpy. No dependencies, O(log N) sampling, full integration with twin-critic SAC.

Uniform replay is wasteful: rare but important transitions (like regime breaks) take forever to re-surface. Schaul 2016's Prioritized Experience Replay fixes that with an O(log N) SumTree sampler and importance-sampling weights that preserve unbiasedness. This post walks through the 170-line implementation, the SumTree invariants, the twin-critic TD-error aggregation, and the one-line switch in SACTrainer.

Reinforcement LearningSACData StructuresPyTorchPython
2026-04-14 · 9 min · English

Same Pairs, Different Algorithms — A Fair SAC vs PPO Benchmark Harness

Why 'we chose SAC' is a better answer than 'we used SAC.' Building a 3-seed comparison that survives an interview.

I've been running SAC for four years. When asked why, the only honest answer was 'it worked first.' That answer fails an interview. This post documents the afternoon I built a proper PPO from scratch (Clipped Objective + GAE(λ) + KL early-stop + Tanh-squashed Gaussian), wired it into the same env SAC uses, and ran a 3-seed comparison. The result isn't just numbers — it's a defensible design choice.

Reinforcement LearningPPOSACBenchmarksPython
2026-04-13 · 9 min · English

Pluggable MLflow + `torch.jit.script` — Tracking 4 Years of Runs, Compiling Actors for ×2 Speedup

An opt-in tracker that's a no-op without config, a Postgres-backed deployment, and a measured CPU latency win from 10 lines of actor changes.

Two upgrades, same afternoon. MLflow tracking wrapped as a context-manager with graceful no-op when MLFLOW_TRACKING_URI is unset — keeps the core trainer free of hard dependencies. torch.jit.script on the SAC actor after three small compatibility edits — measured ×2.13 speedup at batch size 1 on CPU, falling to ×1.08 at batch 128. The MLflow server runs on a docker-compose with Postgres so runs persist across deploys.

MLflowPyTorchMLOpsPerformancePython
2026-04-12 · 8 min · English

Hot-Reloading `.pt` Checkpoints with FastAPI — File `mtime` Is All You Need

A tiny serving layer for RL agents that picks up new training runs without a restart. No watchers, no pub/sub, no Kubernetes.

The training loop drops a new `.pt` into cache/models/rl/stat_pair/ every few hours. The serving process should pick it up — without a restart, without a file watcher, without race conditions. I built a 270-line FastAPI router that does this with one `os.stat()` call per request. Seven end-to-end tests cover cold start, auto-swap, forced reload, and shape validation.

FastAPIPyTorchMLOpsReinforcement LearningPython

Writing about what I built.

Building a Production RAG + Multi-Agent System with LangGraph

Ensemble RL Pair Trading — From QR-DQN Plateau to Sharpe 1.97

Building and Selling Two Ebooks on Gumroad — From Zero to 6 Live Products

CVaR-Aware Position Sizing — Turning QR-DQN Quantiles into a Sizing Multiplier

From DQN to QR-DQN — Distributional RL for Tail Risk in Pair Trading

PER from Scratch — SumTree, Importance-Sampling Weights, and β Annealing

Same Pairs, Different Algorithms — A Fair SAC vs PPO Benchmark Harness

Pluggable MLflow + `torch.jit.script` — Tracking 4 Years of Runs, Compiling Actors for ×2 Speedup

Hot-Reloading `.pt` Checkpoints with FastAPI — File `mtime` Is All You Need

Writing about
what I built.