Gradient-Boosted Scalping on
ES Futures
A multi-model approach to automated trade signal generation, combining tick-level LSTMs, gradient-boosted trees, and bar-level time-series models with adaptive risk management and systematic ablation-driven development.
Abstract
We present a hybrid machine learning system for intraday trading of E-mini S&P 500 futures (ES), deployed in live production on CME via Rithmic execution. The system combines three independently-trained models -- an LSTM for tick-level pattern recognition (5,537 parameters, 64.5% accuracy), a LightGBM classifier on 54 engineered features (69.5% win rate solo, 2x ensemble vote weight), and a bar-level ONNX LSTM for 1-minute time-series prediction (53K parameters, highest solo P&L at +$4,494) -- into a weighted ensemble that requires minimum agreement before any signal reaches execution.
The system processes 54 tick-level features and 11 bar-level technical indicators through a 10-stage decision pipeline with 30+ sequential entry gates, achieving a 92.6% signal rejection rate. Only signals surviving all gates reach the exchange.
Development follows an ablation-first methodology: every model and rule-based strategy is validated through systematic ablation studies on historical data before deployment. Three rule-based strategies (Momentum, Mean Revert, STAT_ARB) were added and subsequently removed after ablation proved they degraded ML ensemble performance. The resulting ML-only configuration achieves 65.7% win rate, 1.80 profit factor, and 16.25x return-to-max-drawdown ratio across 447 simulated trades.
Key claim: In this domain, ML models consistently outperform when trading alone. Every rule-based heuristic tested -- momentum, mean-reversion, and statistical arbitrage -- degraded ensemble performance when measured through ablation. The edge comes from letting the models decide, not from layering human intuition on top.
Data & Feature Engineering
The system ingests tick-level market data from CME E-mini S&P 500 futures (ES) via Rithmic, captured by a Sierra Chart DLL (T29-LocalBridge_v39.cpp) that writes two CSV streams in real time. The Python bridge reads these files in a 20ms polling loop.
The training dataset consists of ~200K raw ticks collected from live ES H6 futures trading, supplemented by a 614K tick backfill (ticks_RGW.csv). All three models were retrained on February 26, 2026 using the 200K tick dataset with temporal train/validation splits (80/20) and no future-peeking.
The feature engine (features.py) computes 54 tick-level features on every incoming tick,
organized into the following categories:
Price Dynamics
12 featuresMulti-scale returns (5t, 10t, 20t, 50t, 100t), realized volatility at multiple horizons, momentum indicators, price acceleration, and velocity of price change.
Microstructure
15 featuresBid-ask spread, trade flow imbalance, order flow toxicity (VPIN proxy), DOM imbalance across 10 levels, volume surge detection, and trade intensity metrics.
Session Context
8 featuresPosition within session range, distance from VWAP, distance from session high/low, range expansion rate, and session-relative volume metrics.
Time Features
19 featuresCyclical encoding of hour/minute (sin/cos), session flags (RTH/ETH), time-to-session-close, day-of-week encoding, and calendar event proximity indicators.
All tick features are normalized to a [-10, 10] range using rolling statistics. Additionally, an 11-feature bar-level pipeline aggregates 1-minute OHLCV bars and computes standard technical indicators used exclusively by the ONNX model:
The PSO (Oscillator Power) indicator is computed from a 500-tick stochastic oscillator smoothed by EMA-25, producing a value in [-1.0, +1.0]. PSO drives the position sizing tier system and provides momentum alignment context to the ensemble. See full pipeline diagram →
Model Architectures
The ensemble comprises three independently-trained models, each designed to capture different temporal scales and feature representations. All models produce directional signals (BUY/SELL) with confidence scores. No single model can trigger a trade alone.
The Founder
The original model that proved an edge exists in this data. A compact LSTM operating on raw tick-aggregated OHLCV bars with bracket-based labeling.
Input: 7-bar lookback of OHLCV data aggregated from 20-tick bars. Labeling: symmetric bracket (+/-5.0 pts, 2.5 pt threshold). Output: tanh activation producing BUY/SELL signal.
1x voteThe Sniper
Fires rarely, but when it speaks, the ensemble listens. 200-tree gradient-boosted classifier across all 54 tick features with 2x vote weight.
Input: full 54-feature vector computed at each tick. Labeling: trend-based with 100-tick lookahead. Output: predict_proba producing STRONG_BUY/BUY/SELL signals. Solo performance: 131 trades, PF 1.87.
2x vote (highest quality)The Workhorse
A larger LSTM operating on 1-minute bar aggregations with technical analysis features. The volume earner that sees the bigger picture.
Input: 60-bar lookback of 11 TA features on 1-minute OHLCV. Architecture: 2-layer LSTM(64) with dropout. Output: sigmoid activation, 30-minute prediction horizon. Solo: 278 trades, +$4,494 net.
1x voteTraining methodology: All models use temporal train/validation splits (80/20) with strict temporal ordering -- no shuffling, no future information leakage. F1 and ONNX use class-balanced sampling to handle the inherent imbalance between directional labels. F2 LightGBM uses scale_pos_weight for class balance. All three models were retrained on February 26, 2026 on 200K ticks, producing measurable accuracy improvements: F1 64.5% (was 55.5%), F2 54.6% (was 52.9%), ONNX 54.3% (was 53.8%).
Ensemble & Decision Pipeline
The three models vote independently on every tick. Votes are weighted: F1 contributes 1 vote, F2 contributes 2 votes (reflecting its superior per-trade quality), and ONNX contributes 1 vote, for a total of 4 votes. The ensemble computes:
AI direction override: In earlier iterations, a mean-revert pipeline determined trade direction and models voted on confidence. The current system lets ML models directly determine trade side through ensemble consensus. This eliminated directional conflicts between rule-based signals and model predictions.
The full decision pipeline consists of 10 stages. A signal must survive all stages to become an order on the exchange:
| Stage | Name | Description | Failure Mode |
|---|---|---|---|
| 1 | Tick Input | CME ES tick stream via Rithmic/Sierra DLL | No new tick: skip |
| 2 | Feature Extraction | 54 tick features + PSO + 11 bar features | delta_pts == 0: skip |
| 3 | Model Inference | F1(1x) + F2(2x) + ONNX(1x) ensemble vote | Agreement below threshold |
| 4 | AI Evaluation | Regime detection, quality score, Kelly sizing | Confidence < 0.20 |
| 5 | Gate Stack | 30+ sequential entry filters | Any gate fails: blocked |
| 6 | PSO + Sizing | Oscillator alignment, final position size | CONTRA reduces to 0.5x |
| 7 | Order Execution | New entry or pyramid add via CSV/DLL | Position conflict |
| 8 | Bracket Structure | OCO stop + target, R-multiple trailing | Exchange-managed |
| 9 | Exit Management | Tick-by-tick evaluation while in position | Various exit triggers |
| 10 | Post-Trade | Record result, update statistics, cooldown | Cooldown enforced |
Risk Management
The risk framework operates at three levels: per-trade brackets, per-session budgets, and per-day global limits. The gate stack at Stage 5 is deliberately aggressive -- it is cheaper to miss a good trade than to take a bad one. Each gate was added to fix a specific failure mode observed in live or simulated trading.
30+ Sequential Entry Gates (92.6% rejection rate on backtest, Feb 24):
Session Budgets:
| Session | Stop | Target | Ensemble | Cooldown | Spread | Loss Limit |
|---|---|---|---|---|---|---|
| RTH | 10t | 25t | 25% | 30s | 1t | $3,000 |
| ETH Pre | 10t | 25t | 25% | 40s | 2t | $350 |
| ETH Post | 10t | 16t | 40% | 60s | 2t | $350 |
| ETH Asia | 12t | 16t | 40% | 60s | 4t | $350 |
| ETH Europe | 10t | 16t | 40% | 50s | 3t | $350 |
PSO-driven Kelly sizing determines position size through a 9-layer soft multiplier stack. The PSO alignment value determines the sizing tier: TURBO (pso >= 0.5, 2x multiplier), BOOST (pso >= 0.25, 1.5x), NORMAL (1x), CONTRA (pso <= -0.5, 0.5x). After 3+ consecutive losses, sizing is forced to 1 lot regardless of Kelly or PSO. Maximum position: 10 lots.
Bracket management: Every entry receives an exchange-managed OCO bracket (hard stop + profit target). R-multiple trailing activates at R=0.6 (breakeven), R=1.0 (5t trail RTH / 4t ETH), R=1.5 (tight 3t/2t trail), and R=4.0 (take profit). Minimum 30-second hold time before any stop evaluation prevents the instant-stop toxicity identified in deep analysis. See full risk page →
Ablation Studies
All strategic decisions are validated through systematic ablation. The methodology: monkey-patch the ensemble configuration, replay the backtester on the same historical dataset, and compare 9+ configurations on identical data. This eliminates market condition variance and isolates the contribution of each component.
Central finding: Every rule-based strategy tested degraded ML ensemble performance. Three strategies were added and subsequently removed after ablation:
Momentum
Was blocking profitable ML trades. Ensemble P&L with Momentum: +$5,205. Without: +$10,663. A 105% improvement from removing it.
+105% P&L withoutMean Revert
ML trio alone: +$13,353, 65.7% WR, PF 1.49. With Mean Revert: +$8,442, PF 1.19. A 58% improvement from removal.
+58% P&L withoutSTAT_ARB
Tested twice, removed twice. Only profitable in RTH (+$2,001); catastrophic in ETH Europe (-$1,211, 41.5% WR).
ML-only PF 1.80 vs combined 1.58ML vs. Rule-Based Ablation (644 trades):
| Configuration | Trades | Win Rate | PF | Net P&L | MaxDD | R/MDD |
|---|---|---|---|---|---|---|
| ML-only (trio) | 447 | 67.6% | 1.80 | +$12,633 | $777 | 16.25x |
| SA-only | 197 | 53.8% | 1.15 | +$616 | $1,851 | 0.33x |
| Combined (ML+SA) | 644 | 63.4% | 1.58 | +$13,249 | $1,744 | 7.60x |
Model Solo & Pair Performance:
| Configuration | Trades | Win Rate | PF | Net P&L |
|---|---|---|---|---|
| ML Trio (no MR) | 447 | 65.7% | 1.49 | +$13,353 |
| ONNX Solo | 278 | 62.9% | 1.36 | +$4,494 |
| F2 Solo | 131 | 69.5% | 1.87 | +$4,029 |
| F1+F2 Pair | -- | 67.8% | 2.01 | -- |
| ML + MR (combined) | -- | -- | 1.19 | +$8,442 |
Pattern: All three rule-based strategies (Momentum, Mean Revert, STAT_ARB) were killed by the same ablation process. The ML ensemble consistently performs better without rule-based overlays. This is the central empirical result of this work: let the models decide.
Statistical Results
Performance metrics from the ablation study (ML-only configuration) and deep analysis conducted on 148K ticks of historical data. All results are from simulated trading on historical data; past results do not guarantee future performance.
Headline metrics (ML-only, 447 trades):
| Metric | Value |
|---|---|
| Win Rate | 67.6% |
| Profit Factor | 1.80 |
| Net P&L | +$12,633 |
| Max Drawdown | $777 |
| Return / MaxDD | 16.25x |
| Signal Rejection Rate | 92.6% |
| Total Signals Generated | ~6,000 |
| Trades Executed | 447 |
Deep analysis findings (148K ticks, February 21, 2026):
| Finding | Detail | Action Taken |
|---|---|---|
| Best Hour | 11:00 ET = +$3,266 net | No time blocks applied (models trade all hours) |
| Kill Zones | 10:00, 12:00, 19:00 ET (negative or coin flip) | Observed but not hard-blocked; models adapt |
| Instant Stops | 37 trades under 10s = -$3,022 at 27% WR | Fixed: 30s min hold time before stop checks |
| Hot Hand Fallacy | Win streaks do not predict next trade outcome | No streak-based sizing (confirmed independent events) |
| Walk-Forward | Edge not perfectly consistent every week | Week 2 lost -$628; overall positive across all weeks |
| F2 Fire Rate | 1 per 682 ticks, 92% in RTH | F2 given 2x vote weight due to quality |
Model comparison (solo performance):
| Model | Accuracy | Solo WR | Solo PF | Solo P&L | Trades | Vote Weight |
|---|---|---|---|---|---|---|
| F1 LSTM | 64.5% | -- | -- | -- | -- | 1x |
| F2 LightGBM | 54.6% | 69.5% | 1.87 | +$4,029 | 131 | 2x |
| ONNX 1m | 54.3% | 62.9% | 1.36 | +$4,494 | 278 | 1x |
| Ensemble (trio) | -- | 67.6% | 1.80 | +$12,633 | 447 | -- |
Note on confidence intervals: With 447 trades at 67.6% WR, the 95% confidence interval for the true win rate is approximately 63.1% -- 71.9% (Wilson score interval). The profit factor of 1.80 and R/MDD of 16.25x provide additional evidence of a statistically meaningful edge, though the dataset covers a limited time period and market conditions may change.
Architecture
The system is designed around a broker-agnostic abstraction layer. Two abstract base
classes -- DataService (tick/DOM data) and TradeService (orders/fills) --
define the interface. Broker-specific implementations are swappable providers. All strategy logic,
risk management, and ML inference lives in shared core that never changes when swapping providers.
Current deployment: Parallels Windows 11 VM (ARM64) running Sierra Chart with a C++ DLL (T29-LocalBridge_v39.cpp) that bridges tick data and order flow to a Python bridge via CSV files. The bridge runs a 20ms polling loop, processing ~50 ticks/second during active markets.
State persistence & crash recovery: The system persists state to disk on every trade event. On restart, it reads the last known position from Sierra's fill log and reconciles with the DLL's heartbeat file. A 10-second startup lockout prevents trading on stale state.
Scaling path:
| Phase | Status | Description |
|---|---|---|
| Current | Live | Parallels VM, Sierra Chart, 1-lot base, $50K eval (account 024) |
| Phase 2C | Next | Extract 200+ strategy globals, build broker-agnostic orchestrator |
| Azure VM | Planned | Dedicated cloud VM for Sierra; eliminates Parallels instability |
| Multi-Broker | Planned | IB providers complete; paper trade, then deploy alongside Sierra |
| Multi-Lot | Future | Scale from 1-lot to 4-6 lots as track record builds |
See full architecture page with file map and service diagram →
Current Version
| Version | Date | Description |
|---|---|---|
| v81e | Feb 2026 | All 3 models retrained on 200K ticks. ML ensemble determines trade direction. Session-tunable agreement thresholds. 30s minimum hold time. PSO-driven Kelly sizing. 30+ sequential entry gates. |
The system follows a strict ablation-validated release process. Every model and strategy component is tested against historical data before deployment. Components that degrade performance are removed immediately -- three rule-based strategies have been killed by this process to date.
Past results are from simulated trading on historical data and do not guarantee future performance. The system is deployed in live production ($50K evaluation). All metrics should be interpreted in the context of a limited dataset and evolving market conditions.