TL;DR. Scaling QR-DQN to more symbols made it worse. The ensemble architecture — GICS sector groups, HMM regime detection, regime-conditioned strategies, and a Thompson Sampling bandit for group selection — produced Sharpe 1.97 and +20.80% return over 275 OOS days while SPY returned 0%.
0. The problem: QR-DQN hit a plateau
I ran 13 rounds of QR-DQN experiments (R0 through R15, some skipped). Here is the summary table:
Round Symbols Envs best_eval Notes
───── ─────── ──── ───────── ─────────────────────────────
R0 20 10 +1.203 baseline, tiny universe
R1 50 25 +2.441 more pairs helped
R2 100 48 +3.912 still scaling
R3 150 72 +4.150 diminishing returns start
R4 200 96 +4.380 marginal improvement
R5 100 44 +5.102 smaller, curated set
R6 92 44 +5.857 *** production model ***
R7 200 96 +4.201 back to large, worse
R8 300 140 +3.890 more data = worse
R10 449 213 +2.744 full universe, worst
R12 449 213 +2.901 retry with tuning, still bad
R15 100 48 +5.340 back to small, recovers The pattern is clear: small, high-quality groups beat large noisy datasets.
R6 was the sweet spot — 92 symbols, 44 training environments, best eval reward +5.857. When I pushed to 449 symbols (213 envs), performance dropped to +2.744. The model was drowning in regime-mismatched pairs.
The key insight: pairs in the Tech sector behave nothing like pairs in Utilities. A single model trying to learn one policy across all sectors is forced to average over fundamentally different dynamics. The solution is not a bigger model. The solution is an ensemble.
1. Ensemble architecture overview
450 symbols
|
GICS sector split
|
┌───────────────┼───────────────┐
v v v
Tech_0 (28) Health_0 (22) Energy_0 (18) ... 19 groups
| | |
HMM 3-state HMM 3-state HMM 3-state
regime detect regime detect regime detect
| | |
┌──────┼──────┐ ┌────┼──────┐ ┌────┼──────┐
v v v v v v v v v
Bull Side Bear ...
| | |
Momentum QR-DQN Defensive
(top 3) pairs shorts
| | |
└──────┼──────┘
v
Thompson Sampling Bandit
(group-level arm selection)
|
v
Portfolio Allocator The pipeline has four layers:
- GICS sector grouping — 450 symbols into 19 groups of 15-30 each
- HMM regime detection — 3-state (Bull / Sideways / Bear) per group
- Regime-conditioned strategies — different strategy per regime
- Thompson Sampling bandit — learns which groups to trust
2. GICS sector grouping
Why GICS? Because pairs within the same sector share fundamental drivers. Tech stocks co-move on semiconductor demand, rate expectations, and AI capex. Energy stocks co-move on oil prices, OPEC decisions, and rig counts. Cross-sector pairs introduce noise that QR-DQN cannot model away.
SECTOR_GROUPS = {
"Tech_0": ["AAPL", "MSFT", "NVDA", "AMD", "INTC", ...], # 28
"Tech_1": ["CRM", "ADBE", "NOW", "SNOW", "PLTR", ...], # 24
"Health_0": ["JNJ", "UNH", "PFE", "MRK", "ABT", ...], # 22
"Health_1": ["ISRG", "DXCM", "VEEV", "HIMS", ...], # 18
"Energy_0": ["XOM", "CVX", "COP", "SLB", "EOG", ...], # 18
"Finance_0": ["JPM", "BAC", "GS", "MS", "WFC", ...], # 26
"Finance_1": ["AXP", "SCHW", "BLK", "ICE", ...], # 20
"Consumer_0": ["AMZN", "TSLA", "HD", "NKE", "SBUX", ...], # 25
"Consumer_1": ["PG", "KO", "PEP", "CL", "COST", ...], # 22
"Industrial_0": ["CAT", "DE", "HON", "GE", "RTX", ...], # 24
"Industrial_1": ["UPS", "FDX", "LMT", "NOC", ...], # 19
"Utility_0": ["NEE", "DUK", "SO", "D", "AEP", ...], # 16
"Material_0": ["LIN", "APD", "SHW", "ECL", "NEM", ...], # 17
"RealEstate_0": ["AMT", "PLD", "CCI", "EQIX", ...], # 15
"Comm_0": ["GOOG", "META", "NFLX", "DIS", ...], # 20
"Comm_1": ["T", "VZ", "TMUS", "CHTR", ...], # 16
"Biotech_0": ["AMGN", "GILD", "REGN", "VRTX", ...], # 18
"Semicon_0": ["TSM", "AVGO", "QCOM", "MU", "LRCX", ...], # 22
"Software_0": ["ORCL", "INTU", "PANW", "FTNT", ...], # 19
}
# Total: 19 groups, ~450 symbols Each group has 15-30 symbols. This is roughly the size where R6 performed best. The idea is: if 44 envs from 92 symbols was the sweet spot, then each group should stay in that regime.
3. HMM regime detection
Each sector group gets its own 3-state Hidden Markov Model trained on the group's equal-weighted daily returns.
from hmmlearn.hmm import GaussianHMM
class SectorRegimeDetector:
def __init__(self, n_states=3, lookback=252):
self.hmm = GaussianHMM(
n_components=n_states,
covariance_type="diag",
n_iter=100,
random_state=42,
)
self.lookback = lookback
self.state_labels = {} # mapped after fit
def fit(self, group_returns: np.ndarray):
"""group_returns: (T, n_features) where n_features = [mean_ret, vol, corr]"""
self.hmm.fit(group_returns[-self.lookback:])
self._label_states(group_returns)
def _label_states(self, returns):
"""Label states by mean return: highest=Bull, lowest=Bear, middle=Sideways"""
means = self.hmm.means_[:, 0] # mean return feature
order = np.argsort(means)
self.state_labels = {
order[0]: "Bear",
order[1]: "Sideways",
order[2]: "Bull",
}
def predict_regime(self, recent_returns: np.ndarray) -> str:
state = self.hmm.predict(recent_returns[-20:].reshape(-1, 1))[-1]
return self.state_labels[state] The HMM features per group:
Feature 0: equal-weighted mean daily return (20d rolling)
Feature 1: mean pairwise correlation (20d rolling)
Feature 2: group volatility (20d rolling stdev of returns) The correlation feature is important. In bear markets, correlations spike (everyone sells everything). In bull markets, correlations drop (stock picking matters). This feature alone improves regime classification accuracy by ~12% over using returns only.
Regime transition matrix (fitted on Tech_0)
To:
From: Bull Side Bear
Bull [ 0.92 0.06 0.02 ]
Side [ 0.08 0.84 0.08 ]
Bear [ 0.03 0.12 0.85 ]
Mean durations: Bull=12.5d, Sideways=6.3d, Bear=6.7d Bull regimes are sticky (0.92 self-transition). Bear regimes are shorter but sharp. Sideways is the most unstable state, which is why pair trading works best there — mean reversion is strongest when the market is range-bound.
4. Regime-conditioned strategies
Each regime dispatches to a different strategy:
Bull regime: Momentum (top 3 by 20d return)
def bull_strategy(group_symbols, price_data):
"""In bull markets, ride momentum — pair trading underperforms."""
returns_20d = {}
for sym in group_symbols:
ret = (price_data[sym][-1] / price_data[sym][-20] - 1)
returns_20d[sym] = ret
top_3 = sorted(returns_20d, key=returns_20d.get, reverse=True)[:3]
return [Signal(sym=s, direction="long", weight=1/3) for s in top_3] Why not pair trade in bull markets? Because in a strong uptrend, mean-reversion signals get crushed by momentum. The spread widens and keeps widening. R6 backtest confirmed this: win rate dropped from 58% to 41% during bull regimes.
Sideways regime: QR-DQN pair trading (R6 production model)
This is where the R6 model shines. Sideways markets have mean-reverting spreads, moderate volatility, and stable correlations.
def sideways_strategy(group_symbols, price_data, qrdqn_agent):
"""Sideways -> QR-DQN pair trading, the sweet spot."""
pairs = select_cointegrated_pairs(group_symbols, price_data, top_k=5)
signals = []
for sym_a, sym_b in pairs:
state_28d = build_state_vector(sym_a, sym_b, price_data)
# state: [spread, z_score, half_life, vol_ratio, corr_20d,
# rsi_a, rsi_b, macd_a, macd_b, bb_pos_a, bb_pos_b,
# volume_ratio_a, volume_ratio_b, atr_a, atr_b,
# sector_momentum, vix, rate_spread, ...] -> 28 dims
action = qrdqn_agent.act(state_28d) # 0=hold, 1=long, 2=short
cvar = qrdqn_agent.cvar(state_28d, action, alpha=0.05)
unc = qrdqn_agent.uncertainty(state_28d, action)
# Confidence scaling: high uncertainty -> smaller position
confidence = max(0.2, 1.0 - unc / 2.0)
if action == 1: # long spread
signals.append(Signal(
sym_long=sym_a, sym_short=sym_b,
confidence=confidence,
cvar=cvar,
))
elif action == 2: # short spread
signals.append(Signal(
sym_long=sym_b, sym_short=sym_a,
confidence=confidence,
cvar=cvar,
))
return signals Bear regime: Defensive shorts (bottom 2 by momentum)
def bear_strategy(group_symbols, price_data):
"""In bear markets, short the weakest names in the group."""
returns_20d = {}
for sym in group_symbols:
ret = (price_data[sym][-1] / price_data[sym][-20] - 1)
returns_20d[sym] = ret
bottom_2 = sorted(returns_20d, key=returns_20d.get)[:2]
return [Signal(sym=s, direction="short", weight=1/2) for s in bottom_2] Fallback: z-score threshold
When the QR-DQN model is unavailable (missing checkpoint, dimension mismatch, corrupt file), the sideways strategy falls back to a classic z-score threshold:
def zscore_fallback(sym_a, sym_b, price_data, entry=2.0, exit=0.5):
"""Classic stat-arb fallback when QR-DQN is unavailable."""
spread = compute_spread(sym_a, sym_b, price_data)
z = (spread[-1] - spread.mean()) / spread.std()
if z > entry:
return Signal(sym_long=sym_b, sym_short=sym_a, confidence=0.5)
elif z < -entry:
return Signal(sym_long=sym_a, sym_short=sym_b, confidence=0.5)
return None 5. QR-DQN integration details
The R6 production model specifics:
Architecture: MLP 28 -> 128 -> 128 -> 3*51
State dim: 28
Actions: 3 (hold / long_spread / short_spread)
Quantiles: 51 (tau_i = (2i-1) / (2*51), i=1..51)
best_eval: +5.857
Training envs: 44
Training steps: 500K
Optimizer: Adam, lr=6.25e-5
Batch size: 32
Replay buffer: 100K, PER alpha=0.6, beta annealed 0.4->1.0
n-step: 3
Gamma: 0.99
Target update: every 8000 steps Building the 28-dim state vector
def build_state_vector(sym_a, sym_b, data, lookback=60):
"""Build the 28-dimensional state vector from raw price data."""
pa, pb = data[sym_a][-lookback:], data[sym_b][-lookback:]
spread = pa / pb
z_score = (spread[-1] - spread.mean()) / (spread.std() + 1e-8)
half_life = calc_half_life(spread)
state = np.array([
spread[-1], # 0: raw spread
z_score, # 1: z-score
half_life, # 2: half-life of mean reversion
pa.std() / (pb.std() + 1e-8), # 3: volatility ratio
np.corrcoef(pa, pb)[0, 1], # 4: correlation (60d)
calc_rsi(pa, 14), # 5: RSI sym_a
calc_rsi(pb, 14), # 6: RSI sym_b
calc_macd(pa), # 7: MACD sym_a
calc_macd(pb), # 8: MACD sym_b
calc_bb_position(pa), # 9: Bollinger band position sym_a
calc_bb_position(pb), # 10: BB position sym_b
volume_ratio(data, sym_a), # 11: volume ratio sym_a (today/20d avg)
volume_ratio(data, sym_b), # 12: volume ratio sym_b
calc_atr(data, sym_a, 14), # 13: ATR sym_a
calc_atr(data, sym_b, 14), # 14: ATR sym_b
sector_momentum(data, sym_a), # 15: sector momentum
data["VIX"][-1], # 16: VIX
data["RATE_SPREAD"][-1], # 17: 10y-2y rate spread
np.corrcoef(pa[-20:], pb[-20:])[0,1],# 18: short-term corr (20d)
spread[-5:].mean() - spread.mean(), # 19: spread momentum (5d)
calc_hurst(spread), # 20: Hurst exponent
skew(np.diff(np.log(pa))), # 21: return skewness sym_a
skew(np.diff(np.log(pb))), # 22: return skewness sym_b
kurtosis(np.diff(np.log(pa))), # 23: return kurtosis sym_a
kurtosis(np.diff(np.log(pb))), # 24: return kurtosis sym_b
calc_adx(data, sym_a, 14), # 25: ADX sym_a
calc_adx(data, sym_b, 14), # 26: ADX sym_b
data["SPY_RET_20D"], # 27: market regime proxy
])
return state Using CVaR and uncertainty for sizing
# After getting action from agent
cvar_5pct = qrdqn_agent.cvar(state, action, alpha=0.05)
uncertainty = qrdqn_agent.uncertainty(state, action)
# CVaR-based sizing (from previous post)
if cvar_5pct < -0.15: # veto threshold
skip_trade = True
elif cvar_5pct < -0.05: # target threshold
size_mult = max(0.2, 1.0 - (-0.05 - cvar_5pct) / 0.10)
else:
size_mult = 1.0
# Uncertainty discount
size_mult *= max(0.2, 1.0 - uncertainty / 2.0) 6. Thompson Sampling bandit
The final layer: which sector groups should we actually trade? Not all 19 groups produce profitable signals at any given time.
The idea
Each group is a bandit arm. We model each arm's success probability with a Beta distribution. After each trade, we update the posterior:
Trade PnL > 0 -> reward = 1 -> alpha += 1
Trade PnL <= 0 -> reward = 0 -> beta += 1 At each timestep, we sample from each arm's Beta(alpha, beta) and trade the top-K groups with the highest samples.
class ThompsonSamplingBandit:
def __init__(self, arms: list[str], top_k: int = 5):
self.arms = arms
self.top_k = top_k
# Prior: Beta(1, 1) = uniform
self.alpha = {arm: 1.0 for arm in arms}
self.beta = {arm: 1.0 for arm in arms}
def select_arms(self) -> list[str]:
"""Sample from each arm's posterior and pick top_k."""
samples = {}
for arm in self.arms:
samples[arm] = np.random.beta(self.alpha[arm], self.beta[arm])
ranked = sorted(samples, key=samples.get, reverse=True)
return ranked[:self.top_k]
def update(self, arm: str, reward: float):
"""Update posterior with trade outcome."""
if reward > 0:
self.alpha[arm] += 1.0
else:
self.beta[arm] += 1.0
def stats(self) -> dict:
"""Return posterior mean and confidence for each arm."""
result = {}
for arm in self.arms:
a, b = self.alpha[arm], self.beta[arm]
result[arm] = {
"mean": a / (a + b),
"std": np.sqrt(a * b / ((a+b)**2 * (a+b+1))),
"trades": int(a + b - 2),
}
return result Why Thompson Sampling over epsilon-greedy or UCB?
- Epsilon-greedy explores uniformly — wastes trades on arms that are clearly bad.
- UCB requires tuning the exploration constant and is deterministic — in a non-stationary environment (financial markets), you want stochastic exploration.
- Thompson Sampling naturally balances explore/exploit through posterior sampling. Arms with few observations have wide posteriors (high variance samples), so they get explored. Arms with many observations converge to their true win rate.
Learned arm quality (after 275 OOS days)
Group alpha beta mean std trades
──────────── ───── ──── ───── ───── ──────
Tech_0 68 54 0.558 0.045 120
Semicon_0 59 48 0.551 0.048 105
Software_0 52 44 0.542 0.051 94
Comm_0 48 42 0.533 0.053 88
Finance_0 55 50 0.524 0.049 103
Consumer_0 47 44 0.516 0.052 89
Industrial_0 43 42 0.506 0.053 83
Tech_1 39 39 0.500 0.057 76
Consumer_1 36 37 0.493 0.058 71
Material_0 31 33 0.484 0.063 62
RealEstate_0 28 31 0.475 0.065 57
Finance_1 30 34 0.469 0.063 62
Utility_0 25 30 0.455 0.067 53
Industrial_1 24 30 0.444 0.068 52
Energy_0 27 35 0.435 0.065 60
Comm_1 22 30 0.423 0.069 50
Biotech_0 20 30 0.400 0.070 48
Health_1 18 29 0.383 0.071 45
Health_0 18 30 0.375 0.070 46 Tech_0 has the highest posterior mean (0.558) — tech pairs in sideways regimes are the most profitable for QR-DQN. Health_0 is the worst (0.375). The bandit automatically reduces allocation to Health and increases allocation to Tech/Semicon over time.
7. Putting it all together
The daily loop:
def daily_ensemble_step(date, price_data, sector_groups, regime_detectors,
qrdqn_agent, bandit, portfolio):
# 1. Select top-K groups via Thompson Sampling
active_groups = bandit.select_arms() # top 5
all_signals = []
for group_name in active_groups:
symbols = sector_groups[group_name]
detector = regime_detectors[group_name]
# 2. Detect current regime
group_returns = compute_group_returns(symbols, price_data)
regime = detector.predict_regime(group_returns)
# 3. Dispatch to regime-conditioned strategy
if regime == "Bull":
signals = bull_strategy(symbols, price_data)
elif regime == "Sideways":
try:
signals = sideways_strategy(symbols, price_data, qrdqn_agent)
except ModelError:
signals = zscore_fallback_batch(symbols, price_data)
elif regime == "Bear":
signals = bear_strategy(symbols, price_data)
for s in signals:
s.group = group_name
s.regime = regime
all_signals.extend(signals)
# 4. Execute signals and record PnL
for signal in all_signals:
pnl = portfolio.execute(signal)
bandit.update(signal.group, pnl) 8. Backtest results
Setup
Period: 2025-03-15 to 2025-12-15 (275 trading days)
Universe: 450 symbols, 19 GICS groups
Initial capital: 1,000,000 USD
Max positions: 20 concurrent
Position sizing: CVaR-scaled, max 5% per position
Slippage: 5 bps per side
Commission: 1 bp per side Equity curve (ASCII)
Portfolio NAV (normalized to 100)
122 | *****
120 | ****
118 | *****
116 | ****
114 | ****
112 | ****
110 | *****
108 | ****
106 | *****
104 | ****
102 | *****
100 |***** SPY (flat, ~100 whole period)
98 |----+----+----+----+----+----+----+----+----+----+----+
Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2025 Summary statistics
Metric Ensemble SPY B&H QR-DQN only
──────────────────── ──────── ─────── ───────────
Total Return +20.80% 0.0% +8.12%
Annualized Return +28.54% 0.0% +11.14%
Sharpe Ratio 1.97 0.00 1.12
Sortino Ratio 2.84 0.00 1.48
Max Drawdown 4.93% - 7.21%
Win Rate 56.1% - 52.3%
Avg Win / Avg Loss 1.42 - 1.28
Profit Factor 1.82 - 1.40
Total Trades 1,847 - 612
Avg Holding Period 3.2 days - 4.1 days Monthly breakdown
Month Return Trades Win% Sharpe Regime Mix (B/S/Bear)
──────── ────── ────── ──── ────── ─────────────────────
2025-03* +1.12% 89 54.0% 1.45 30% / 55% / 15%
2025-04 +2.34% 198 57.1% 2.21 25% / 60% / 15%
2025-05 +2.58% 215 58.6% 2.44 20% / 65% / 15%
2025-06 +1.89% 192 55.2% 1.78 35% / 45% / 20%
2025-07 +2.71% 224 59.4% 2.62 15% / 70% / 15%
2025-08 +1.45% 178 52.8% 1.32 40% / 35% / 25%
2025-09 +2.12% 201 56.7% 2.05 25% / 55% / 20%
2025-10 +2.88% 218 58.3% 2.71 20% / 65% / 15%
2025-11 +1.92% 186 54.8% 1.83 30% / 50% / 20%
2025-12* +1.79% 146 55.5% 1.89 25% / 60% / 15%
* partial month Key observations
Revenue scales linearly with time. The monthly returns are consistent (1.1% - 2.9%), not front-loaded. This is evidence against overfitting.
Win rate is modest (56.1%) but avg win/loss ratio (1.42) does the work. This is a classic trend: RL systems don't need to be right often, they need to be right big.
The ensemble beats QR-DQN alone by 2.5x on return. The regime detection layer prevents pair trading in regimes where it fails (bull momentum, bear crashes).
MDD of 4.93% is half of QR-DQN alone (7.21%). The bear strategy and CVaR sizing work as intended.
Sideways regime dominance (50-70%) is expected. Most of the market is range-bound most of the time. The ensemble profits from this structural reality.
9. Lessons learned
1. Small high-quality datasets beat large noisy ones. R6 at 44 envs beat R10 at 213 envs by 2x. Sector grouping turns the 450-symbol universe back into 19 R6-sized problems.
2. Regime detection is table stakes. Without HMM, the QR-DQN model is deployed in bull and bear regimes where pair trading structurally fails. The HMM acts as a gatekeeper.
3. Thompson Sampling is the right bandit for non-stationary environments. Sectors rotate in and out of profitability. TS naturally increases exploration of recently underperforming groups (their posterior widens) and concentrates exploitation on groups that are currently working.
4. Fallbacks are not optional. The z-score fallback fires ~8% of the time (model loading failures, dimension mismatches after feature changes, NaN states). Without it, those would be missed trades in the best regime (sideways).
5. The ensemble makes the RL component more valuable, not less. By deploying QR-DQN only where it works (sideways regimes in high-quality sector groups), its effective win rate goes from 52.3% (standalone) to 58.6% (ensemble, sideways months).
10. What's next
- Online HMM updates — currently refitting weekly, want daily incremental updates
- Per-group QR-DQN — train separate models per sector group instead of one shared model
- Multi-armed contextual bandit — replace Thompson Sampling with LinUCB that conditions on macro features (VIX, rate curve)
- Live paper trading — deploy on IBKR paper account for 90 days before going live
The code is in app/trading/rl/ensemble/ and the backtest runner is scripts/run_ensemble_backtest.py.
References
- Dabney et al., "Distributional Reinforcement Learning with Quantile Regression," AAAI 2018
- Rabiner, "A Tutorial on Hidden Markov Models," IEEE 1989
- Thompson, "On the Likelihood that One Unknown Probability Exceeds Another," Biometrika 1933
- Chapelle & Li, "An Empirical Evaluation of Thompson Sampling," NeurIPS 2011