Ensemble RL Pair Trading — From QR-DQN Plateau to Sharpe 1.97

TL;DR. Scaling QR-DQN to more symbols made it worse. The ensemble architecture — GICS sector groups, HMM regime detection, regime-conditioned strategies, and a Thompson Sampling bandit for group selection — produced Sharpe 1.97 and +20.80% return over 275 OOS days while SPY returned 0%.

0. The problem: QR-DQN hit a plateau

I ran 13 rounds of QR-DQN experiments (R0 through R15, some skipped). Here is the summary table:

Round   Symbols   Envs   best_eval   Notes
─────   ───────   ────   ─────────   ─────────────────────────────
R0      20        10     +1.203      baseline, tiny universe
R1      50        25     +2.441      more pairs helped
R2      100       48     +3.912      still scaling
R3      150       72     +4.150      diminishing returns start
R4      200       96     +4.380      marginal improvement
R5      100       44     +5.102      smaller, curated set
R6      92        44     +5.857      *** production model ***
R7      200       96     +4.201      back to large, worse
R8      300       140    +3.890      more data = worse
R10     449       213    +2.744      full universe, worst
R12     449       213    +2.901      retry with tuning, still bad
R15     100       48     +5.340      back to small, recovers

The pattern is clear: small, high-quality groups beat large noisy datasets.

R6 was the sweet spot — 92 symbols, 44 training environments, best eval reward +5.857. When I pushed to 449 symbols (213 envs), performance dropped to +2.744. The model was drowning in regime-mismatched pairs.

The key insight: pairs in the Tech sector behave nothing like pairs in Utilities. A single model trying to learn one policy across all sectors is forced to average over fundamentally different dynamics. The solution is not a bigger model. The solution is an ensemble.

1. Ensemble architecture overview

                        450 symbols
                            |
                    GICS sector split
                            |
            ┌───────────────┼───────────────┐
            v               v               v
      Tech_0 (28)    Health_0 (22)   Energy_0 (18)   ... 19 groups
            |               |               |
      HMM 3-state     HMM 3-state     HMM 3-state
      regime detect    regime detect    regime detect
            |               |               |
     ┌──────┼──────┐  ┌────┼──────┐  ┌────┼──────┐
     v      v      v  v    v      v  v    v      v
   Bull  Side   Bear  ...                          
     |      |      |
  Momentum QR-DQN Defensive
  (top 3)  pairs  shorts
     |      |      |
     └──────┼──────┘
            v
   Thompson Sampling Bandit
   (group-level arm selection)
            |
            v
     Portfolio Allocator

The pipeline has four layers:

GICS sector grouping — 450 symbols into 19 groups of 15-30 each
HMM regime detection — 3-state (Bull / Sideways / Bear) per group
Regime-conditioned strategies — different strategy per regime
Thompson Sampling bandit — learns which groups to trust

2. GICS sector grouping

Why GICS? Because pairs within the same sector share fundamental drivers. Tech stocks co-move on semiconductor demand, rate expectations, and AI capex. Energy stocks co-move on oil prices, OPEC decisions, and rig counts. Cross-sector pairs introduce noise that QR-DQN cannot model away.

SECTOR_GROUPS = {
    "Tech_0":       ["AAPL", "MSFT", "NVDA", "AMD", "INTC", ...],  # 28
    "Tech_1":       ["CRM", "ADBE", "NOW", "SNOW", "PLTR", ...],   # 24
    "Health_0":     ["JNJ", "UNH", "PFE", "MRK", "ABT", ...],     # 22
    "Health_1":     ["ISRG", "DXCM", "VEEV", "HIMS", ...],        # 18
    "Energy_0":     ["XOM", "CVX", "COP", "SLB", "EOG", ...],     # 18
    "Finance_0":    ["JPM", "BAC", "GS", "MS", "WFC", ...],       # 26
    "Finance_1":    ["AXP", "SCHW", "BLK", "ICE", ...],           # 20
    "Consumer_0":   ["AMZN", "TSLA", "HD", "NKE", "SBUX", ...],   # 25
    "Consumer_1":   ["PG", "KO", "PEP", "CL", "COST", ...],       # 22
    "Industrial_0": ["CAT", "DE", "HON", "GE", "RTX", ...],       # 24
    "Industrial_1": ["UPS", "FDX", "LMT", "NOC", ...],            # 19
    "Utility_0":    ["NEE", "DUK", "SO", "D", "AEP", ...],        # 16
    "Material_0":   ["LIN", "APD", "SHW", "ECL", "NEM", ...],     # 17
    "RealEstate_0": ["AMT", "PLD", "CCI", "EQIX", ...],           # 15
    "Comm_0":       ["GOOG", "META", "NFLX", "DIS", ...],         # 20
    "Comm_1":       ["T", "VZ", "TMUS", "CHTR", ...],             # 16
    "Biotech_0":    ["AMGN", "GILD", "REGN", "VRTX", ...],        # 18
    "Semicon_0":    ["TSM", "AVGO", "QCOM", "MU", "LRCX", ...],   # 22
    "Software_0":   ["ORCL", "INTU", "PANW", "FTNT", ...],        # 19
}
# Total: 19 groups, ~450 symbols

Each group has 15-30 symbols. This is roughly the size where R6 performed best. The idea is: if 44 envs from 92 symbols was the sweet spot, then each group should stay in that regime.

3. HMM regime detection

Each sector group gets its own 3-state Hidden Markov Model trained on the group's equal-weighted daily returns.

from hmmlearn.hmm import GaussianHMM

class SectorRegimeDetector:
    def __init__(self, n_states=3, lookback=252):
        self.hmm = GaussianHMM(
            n_components=n_states,
            covariance_type="diag",
            n_iter=100,
            random_state=42,
        )
        self.lookback = lookback
        self.state_labels = {}  # mapped after fit

    def fit(self, group_returns: np.ndarray):
        """group_returns: (T, n_features) where n_features = [mean_ret, vol, corr]"""
        self.hmm.fit(group_returns[-self.lookback:])
        self._label_states(group_returns)

    def _label_states(self, returns):
        """Label states by mean return: highest=Bull, lowest=Bear, middle=Sideways"""
        means = self.hmm.means_[:, 0]  # mean return feature
        order = np.argsort(means)
        self.state_labels = {
            order[0]: "Bear",
            order[1]: "Sideways",
            order[2]: "Bull",
        }

    def predict_regime(self, recent_returns: np.ndarray) -> str:
        state = self.hmm.predict(recent_returns[-20:].reshape(-1, 1))[-1]
        return self.state_labels[state]

The HMM features per group:

Feature 0: equal-weighted mean daily return (20d rolling)
Feature 1: mean pairwise correlation (20d rolling)
Feature 2: group volatility (20d rolling stdev of returns)

The correlation feature is important. In bear markets, correlations spike (everyone sells everything). In bull markets, correlations drop (stock picking matters). This feature alone improves regime classification accuracy by ~12% over using returns only.

Regime transition matrix (fitted on Tech_0)

             To:
From:     Bull    Side    Bear
Bull    [ 0.92    0.06    0.02 ]
Side    [ 0.08    0.84    0.08 ]
Bear    [ 0.03    0.12    0.85 ]

Mean durations: Bull=12.5d, Sideways=6.3d, Bear=6.7d

Bull regimes are sticky (0.92 self-transition). Bear regimes are shorter but sharp. Sideways is the most unstable state, which is why pair trading works best there — mean reversion is strongest when the market is range-bound.

4. Regime-conditioned strategies

Each regime dispatches to a different strategy:

Bull regime: Momentum (top 3 by 20d return)

def bull_strategy(group_symbols, price_data):
    """In bull markets, ride momentum — pair trading underperforms."""
    returns_20d = {}
    for sym in group_symbols:
        ret = (price_data[sym][-1] / price_data[sym][-20] - 1)
        returns_20d[sym] = ret

    top_3 = sorted(returns_20d, key=returns_20d.get, reverse=True)[:3]
    return [Signal(sym=s, direction="long", weight=1/3) for s in top_3]

Why not pair trade in bull markets? Because in a strong uptrend, mean-reversion signals get crushed by momentum. The spread widens and keeps widening. R6 backtest confirmed this: win rate dropped from 58% to 41% during bull regimes.

Sideways regime: QR-DQN pair trading (R6 production model)

This is where the R6 model shines. Sideways markets have mean-reverting spreads, moderate volatility, and stable correlations.

def sideways_strategy(group_symbols, price_data, qrdqn_agent):
    """Sideways -> QR-DQN pair trading, the sweet spot."""
    pairs = select_cointegrated_pairs(group_symbols, price_data, top_k=5)
    signals = []

    for sym_a, sym_b in pairs:
        state_28d = build_state_vector(sym_a, sym_b, price_data)
        # state: [spread, z_score, half_life, vol_ratio, corr_20d,
        #         rsi_a, rsi_b, macd_a, macd_b, bb_pos_a, bb_pos_b,
        #         volume_ratio_a, volume_ratio_b, atr_a, atr_b,
        #         sector_momentum, vix, rate_spread, ...]  -> 28 dims

        action = qrdqn_agent.act(state_28d)    # 0=hold, 1=long, 2=short
        cvar   = qrdqn_agent.cvar(state_28d, action, alpha=0.05)
        unc    = qrdqn_agent.uncertainty(state_28d, action)

        # Confidence scaling: high uncertainty -> smaller position
        confidence = max(0.2, 1.0 - unc / 2.0)

        if action == 1:  # long spread
            signals.append(Signal(
                sym_long=sym_a, sym_short=sym_b,
                confidence=confidence,
                cvar=cvar,
            ))
        elif action == 2:  # short spread
            signals.append(Signal(
                sym_long=sym_b, sym_short=sym_a,
                confidence=confidence,
                cvar=cvar,
            ))

    return signals

Bear regime: Defensive shorts (bottom 2 by momentum)

def bear_strategy(group_symbols, price_data):
    """In bear markets, short the weakest names in the group."""
    returns_20d = {}
    for sym in group_symbols:
        ret = (price_data[sym][-1] / price_data[sym][-20] - 1)
        returns_20d[sym] = ret

    bottom_2 = sorted(returns_20d, key=returns_20d.get)[:2]
    return [Signal(sym=s, direction="short", weight=1/2) for s in bottom_2]

Fallback: z-score threshold

When the QR-DQN model is unavailable (missing checkpoint, dimension mismatch, corrupt file), the sideways strategy falls back to a classic z-score threshold:

def zscore_fallback(sym_a, sym_b, price_data, entry=2.0, exit=0.5):
    """Classic stat-arb fallback when QR-DQN is unavailable."""
    spread = compute_spread(sym_a, sym_b, price_data)
    z = (spread[-1] - spread.mean()) / spread.std()

    if z > entry:
        return Signal(sym_long=sym_b, sym_short=sym_a, confidence=0.5)
    elif z < -entry:
        return Signal(sym_long=sym_a, sym_short=sym_b, confidence=0.5)
    return None

5. QR-DQN integration details

The R6 production model specifics:

Architecture:    MLP 28 -> 128 -> 128 -> 3*51
State dim:       28
Actions:         3 (hold / long_spread / short_spread)
Quantiles:       51 (tau_i = (2i-1) / (2*51), i=1..51)
best_eval:       +5.857
Training envs:   44
Training steps:  500K
Optimizer:       Adam, lr=6.25e-5
Batch size:      32
Replay buffer:   100K, PER alpha=0.6, beta annealed 0.4->1.0
n-step:          3
Gamma:           0.99
Target update:   every 8000 steps

Building the 28-dim state vector

def build_state_vector(sym_a, sym_b, data, lookback=60):
    """Build the 28-dimensional state vector from raw price data."""
    pa, pb = data[sym_a][-lookback:], data[sym_b][-lookback:]

    spread = pa / pb
    z_score = (spread[-1] - spread.mean()) / (spread.std() + 1e-8)
    half_life = calc_half_life(spread)

    state = np.array([
        spread[-1],                          # 0: raw spread
        z_score,                             # 1: z-score
        half_life,                           # 2: half-life of mean reversion
        pa.std() / (pb.std() + 1e-8),        # 3: volatility ratio
        np.corrcoef(pa, pb)[0, 1],           # 4: correlation (60d)
        calc_rsi(pa, 14),                    # 5: RSI sym_a
        calc_rsi(pb, 14),                    # 6: RSI sym_b
        calc_macd(pa),                       # 7: MACD sym_a
        calc_macd(pb),                       # 8: MACD sym_b
        calc_bb_position(pa),                # 9: Bollinger band position sym_a
        calc_bb_position(pb),                # 10: BB position sym_b
        volume_ratio(data, sym_a),           # 11: volume ratio sym_a (today/20d avg)
        volume_ratio(data, sym_b),           # 12: volume ratio sym_b
        calc_atr(data, sym_a, 14),           # 13: ATR sym_a
        calc_atr(data, sym_b, 14),           # 14: ATR sym_b
        sector_momentum(data, sym_a),        # 15: sector momentum
        data["VIX"][-1],                     # 16: VIX
        data["RATE_SPREAD"][-1],             # 17: 10y-2y rate spread
        np.corrcoef(pa[-20:], pb[-20:])[0,1],# 18: short-term corr (20d)
        spread[-5:].mean() - spread.mean(),  # 19: spread momentum (5d)
        calc_hurst(spread),                  # 20: Hurst exponent
        skew(np.diff(np.log(pa))),           # 21: return skewness sym_a
        skew(np.diff(np.log(pb))),           # 22: return skewness sym_b
        kurtosis(np.diff(np.log(pa))),       # 23: return kurtosis sym_a
        kurtosis(np.diff(np.log(pb))),       # 24: return kurtosis sym_b
        calc_adx(data, sym_a, 14),           # 25: ADX sym_a
        calc_adx(data, sym_b, 14),           # 26: ADX sym_b
        data["SPY_RET_20D"],                 # 27: market regime proxy
    ])
    return state

Using CVaR and uncertainty for sizing

# After getting action from agent
cvar_5pct = qrdqn_agent.cvar(state, action, alpha=0.05)
uncertainty = qrdqn_agent.uncertainty(state, action)

# CVaR-based sizing (from previous post)
if cvar_5pct < -0.15:  # veto threshold
    skip_trade = True
elif cvar_5pct < -0.05:  # target threshold
    size_mult = max(0.2, 1.0 - (-0.05 - cvar_5pct) / 0.10)
else:
    size_mult = 1.0

# Uncertainty discount
size_mult *= max(0.2, 1.0 - uncertainty / 2.0)

6. Thompson Sampling bandit

The final layer: which sector groups should we actually trade? Not all 19 groups produce profitable signals at any given time.

The idea

Each group is a bandit arm. We model each arm's success probability with a Beta distribution. After each trade, we update the posterior:

Trade PnL > 0  ->  reward = 1  ->  alpha += 1
Trade PnL <= 0 ->  reward = 0  ->  beta  += 1

At each timestep, we sample from each arm's Beta(alpha, beta) and trade the top-K groups with the highest samples.

class ThompsonSamplingBandit:
    def __init__(self, arms: list[str], top_k: int = 5):
        self.arms = arms
        self.top_k = top_k
        # Prior: Beta(1, 1) = uniform
        self.alpha = {arm: 1.0 for arm in arms}
        self.beta  = {arm: 1.0 for arm in arms}

    def select_arms(self) -> list[str]:
        """Sample from each arm's posterior and pick top_k."""
        samples = {}
        for arm in self.arms:
            samples[arm] = np.random.beta(self.alpha[arm], self.beta[arm])
        ranked = sorted(samples, key=samples.get, reverse=True)
        return ranked[:self.top_k]

    def update(self, arm: str, reward: float):
        """Update posterior with trade outcome."""
        if reward > 0:
            self.alpha[arm] += 1.0
        else:
            self.beta[arm] += 1.0

    def stats(self) -> dict:
        """Return posterior mean and confidence for each arm."""
        result = {}
        for arm in self.arms:
            a, b = self.alpha[arm], self.beta[arm]
            result[arm] = {
                "mean": a / (a + b),
                "std": np.sqrt(a * b / ((a+b)**2 * (a+b+1))),
                "trades": int(a + b - 2),
            }
        return result

Why Thompson Sampling over epsilon-greedy or UCB?

Epsilon-greedy explores uniformly — wastes trades on arms that are clearly bad.
UCB requires tuning the exploration constant and is deterministic — in a non-stationary environment (financial markets), you want stochastic exploration.
Thompson Sampling naturally balances explore/exploit through posterior sampling. Arms with few observations have wide posteriors (high variance samples), so they get explored. Arms with many observations converge to their true win rate.

Learned arm quality (after 275 OOS days)

Group           alpha    beta   mean    std     trades
────────────    ─────    ────   ─────   ─────   ──────
Tech_0           68       54    0.558   0.045     120
Semicon_0        59       48    0.551   0.048     105
Software_0       52       44    0.542   0.051      94
Comm_0           48       42    0.533   0.053      88
Finance_0        55       50    0.524   0.049     103
Consumer_0       47       44    0.516   0.052      89
Industrial_0     43       42    0.506   0.053      83
Tech_1           39       39    0.500   0.057      76
Consumer_1       36       37    0.493   0.058      71
Material_0       31       33    0.484   0.063      62
RealEstate_0     28       31    0.475   0.065      57
Finance_1        30       34    0.469   0.063      62
Utility_0        25       30    0.455   0.067      53
Industrial_1     24       30    0.444   0.068      52
Energy_0         27       35    0.435   0.065      60
Comm_1           22       30    0.423   0.069      50
Biotech_0        20       30    0.400   0.070      48
Health_1         18       29    0.383   0.071      45
Health_0         18       30    0.375   0.070      46

Tech_0 has the highest posterior mean (0.558) — tech pairs in sideways regimes are the most profitable for QR-DQN. Health_0 is the worst (0.375). The bandit automatically reduces allocation to Health and increases allocation to Tech/Semicon over time.

7. Putting it all together

The daily loop:

def daily_ensemble_step(date, price_data, sector_groups, regime_detectors,
                         qrdqn_agent, bandit, portfolio):
    # 1. Select top-K groups via Thompson Sampling
    active_groups = bandit.select_arms()  # top 5

    all_signals = []
    for group_name in active_groups:
        symbols = sector_groups[group_name]
        detector = regime_detectors[group_name]

        # 2. Detect current regime
        group_returns = compute_group_returns(symbols, price_data)
        regime = detector.predict_regime(group_returns)

        # 3. Dispatch to regime-conditioned strategy
        if regime == "Bull":
            signals = bull_strategy(symbols, price_data)
        elif regime == "Sideways":
            try:
                signals = sideways_strategy(symbols, price_data, qrdqn_agent)
            except ModelError:
                signals = zscore_fallback_batch(symbols, price_data)
        elif regime == "Bear":
            signals = bear_strategy(symbols, price_data)

        for s in signals:
            s.group = group_name
            s.regime = regime
        all_signals.extend(signals)

    # 4. Execute signals and record PnL
    for signal in all_signals:
        pnl = portfolio.execute(signal)
        bandit.update(signal.group, pnl)

8. Backtest results

Setup

Period:          2025-03-15 to 2025-12-15 (275 trading days)
Universe:        450 symbols, 19 GICS groups
Initial capital: 1,000,000 USD
Max positions:   20 concurrent
Position sizing:  CVaR-scaled, max 5% per position
Slippage:        5 bps per side
Commission:      1 bp per side

Equity curve (ASCII)

Portfolio NAV (normalized to 100)
 122 |                                                        *****
 120 |                                                   ****
 118 |                                              *****
 116 |                                         ****
 114 |                                    ****
 112 |                               ****
 110 |                          *****
 108 |                     ****
 106 |                *****
 104 |           ****
 102 |      *****
 100 |*****          SPY (flat, ~100 whole period)
  98 |----+----+----+----+----+----+----+----+----+----+----+
     Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
     2025

Summary statistics

Metric                    Ensemble     SPY B&H     QR-DQN only
────────────────────      ────────     ───────     ───────────
Total Return              +20.80%        0.0%        +8.12%
Annualized Return         +28.54%        0.0%       +11.14%
Sharpe Ratio                1.97         0.00          1.12
Sortino Ratio               2.84         0.00          1.48
Max Drawdown               4.93%          -           7.21%
Win Rate                   56.1%          -           52.3%
Avg Win / Avg Loss          1.42          -            1.28
Profit Factor               1.82          -            1.40
Total Trades               1,847          -             612
Avg Holding Period         3.2 days       -           4.1 days

Monthly breakdown

Month       Return    Trades    Win%    Sharpe    Regime Mix (B/S/Bear)
────────    ──────    ──────    ────    ──────    ─────────────────────
2025-03*    +1.12%       89    54.0%     1.45    30% / 55% / 15%
2025-04     +2.34%      198    57.1%     2.21    25% / 60% / 15%
2025-05     +2.58%      215    58.6%     2.44    20% / 65% / 15%
2025-06     +1.89%      192    55.2%     1.78    35% / 45% / 20%
2025-07     +2.71%      224    59.4%     2.62    15% / 70% / 15%
2025-08     +1.45%      178    52.8%     1.32    40% / 35% / 25%
2025-09     +2.12%      201    56.7%     2.05    25% / 55% / 20%
2025-10     +2.88%      218    58.3%     2.71    20% / 65% / 15%
2025-11     +1.92%      186    54.8%     1.83    30% / 50% / 20%
2025-12*    +1.79%      146    55.5%     1.89    25% / 60% / 15%

* partial month

Key observations

Revenue scales linearly with time. The monthly returns are consistent (1.1% - 2.9%), not front-loaded. This is evidence against overfitting.
Win rate is modest (56.1%) but avg win/loss ratio (1.42) does the work. This is a classic trend: RL systems don't need to be right often, they need to be right big.
The ensemble beats QR-DQN alone by 2.5x on return. The regime detection layer prevents pair trading in regimes where it fails (bull momentum, bear crashes).
MDD of 4.93% is half of QR-DQN alone (7.21%). The bear strategy and CVaR sizing work as intended.
Sideways regime dominance (50-70%) is expected. Most of the market is range-bound most of the time. The ensemble profits from this structural reality.

9. Lessons learned

1. Small high-quality datasets beat large noisy ones. R6 at 44 envs beat R10 at 213 envs by 2x. Sector grouping turns the 450-symbol universe back into 19 R6-sized problems.

2. Regime detection is table stakes. Without HMM, the QR-DQN model is deployed in bull and bear regimes where pair trading structurally fails. The HMM acts as a gatekeeper.

3. Thompson Sampling is the right bandit for non-stationary environments. Sectors rotate in and out of profitability. TS naturally increases exploration of recently underperforming groups (their posterior widens) and concentrates exploitation on groups that are currently working.

4. Fallbacks are not optional. The z-score fallback fires ~8% of the time (model loading failures, dimension mismatches after feature changes, NaN states). Without it, those would be missed trades in the best regime (sideways).

5. The ensemble makes the RL component more valuable, not less. By deploying QR-DQN only where it works (sideways regimes in high-quality sector groups), its effective win rate goes from 52.3% (standalone) to 58.6% (ensemble, sideways months).

10. What's next

Online HMM updates — currently refitting weekly, want daily incremental updates
Per-group QR-DQN — train separate models per sector group instead of one shared model
Multi-armed contextual bandit — replace Thompson Sampling with LinUCB that conditions on macro features (VIX, rate curve)
Live paper trading — deploy on IBKR paper account for 90 days before going live

The code is in app/trading/rl/ensemble/ and the backtest runner is scripts/run_ensemble_backtest.py.

References

Dabney et al., "Distributional Reinforcement Learning with Quantile Regression," AAAI 2018
Rabiner, "A Tutorial on Hidden Markov Models," IEEE 1989
Thompson, "On the Likelihood that One Unknown Probability Exceeds Another," Biometrika 1933
Chapelle & Li, "An Empirical Evaluation of Thompson Sampling," NeurIPS 2011