Gradient-Boosted Scalping on
ES Futures

A multi-model approach to automated trade signal generation, combining tick-level LSTMs, gradient-boosted trees, and bar-level time-series models with adaptive risk management and systematic ablation-driven development.

BHF Capital / Keystone Strategy believe.v125 March 2026 R. G. Williams

1. Abstract

Abstract

We present a hybrid machine learning system for intraday trading of E-mini S&P 500 futures (ES), deployed in live production on CME via Rithmic execution. The system combines three independently-trained models -- an LSTM for tick-level pattern recognition (5,537 parameters, 64.5% accuracy), a LightGBM classifier on 54 engineered features (69.5% win rate solo, 2x ensemble vote weight), and a bar-level ONNX LSTM for 1-minute time-series prediction (53K parameters, highest solo P&L at +$4,494) -- into a weighted ensemble that requires minimum agreement before any signal reaches execution.

The system processes 54 tick-level features and 11 bar-level technical indicators through a 10-stage decision pipeline with 30+ sequential entry gates, achieving a 92.6% signal rejection rate. Only signals surviving all gates reach the exchange.

Development follows an ablation-first methodology: every model and rule-based strategy is validated through systematic ablation studies on historical data before deployment. Three rule-based strategies (Momentum, Mean Revert, STAT_ARB) were added and subsequently removed after ablation proved they degraded ML ensemble performance. The resulting ML-only configuration achieves 65.7% win rate, 1.80 profit factor, and 16.25x return-to-max-drawdown ratio across 447 simulated trades.

Key claim: In this domain, ML models consistently outperform when trading alone. Every rule-based heuristic tested -- momentum, mean-reversion, and statistical arbitrage -- degraded ensemble performance when measured through ablation. The edge comes from letting the models decide, not from layering human intuition on top.

2. Data & Feature Engineering

Data & Feature Engineering

The system ingests tick-level market data from CME E-mini S&P 500 futures (ES) via Rithmic, captured by a Sierra Chart DLL (T29-LocalBridge_v39.cpp) that writes two CSV streams in real time. The Python bridge reads these files in a 20ms polling loop.

        // Tick CSV format (6 fields, v36+)
        seq, ms, price, bid_vol, ask_vol, total_vol

        // DOM CSV format (10 bid + 10 ask levels)
        seq, ms, last_px, bid_vol, ask_vol, N, b0_px, b0_sz, ..., a9_px, a9_sz
      

The training dataset consists of ~200K raw ticks collected from live ES H6 futures trading, supplemented by a 614K tick backfill (ticks_RGW.csv). All three models were retrained on February 26, 2026 using the 200K tick dataset with temporal train/validation splits (80/20) and no future-peeking.

The feature engine (features.py) computes 54 tick-level features on every incoming tick, organized into the following categories:

Price Dynamics

12 features

Multi-scale returns (5t, 10t, 20t, 50t, 100t), realized volatility at multiple horizons, momentum indicators, price acceleration, and velocity of price change.

Microstructure

15 features

Bid-ask spread, trade flow imbalance, order flow toxicity (VPIN proxy), DOM imbalance across 10 levels, volume surge detection, and trade intensity metrics.

Session Context

8 features

Position within session range, distance from VWAP, distance from session high/low, range expansion rate, and session-relative volume metrics.

Time Features

19 features

Cyclical encoding of hour/minute (sin/cos), session flags (RTH/ETH), time-to-session-close, day-of-week encoding, and calendar event proximity indicators.

All tick features are normalized to a [-10, 10] range using rolling statistics. Additionally, an 11-feature bar-level pipeline aggregates 1-minute OHLCV bars and computes standard technical indicators used exclusively by the ONNX model:

        # 11 Bar Features (1-min OHLCV aggregation)
        RSI(14), ATR(14), MACD signal, MACD histogram,
        Bollinger %B, Bollinger bandwidth, Stochastic %K,
        Stochastic %D, EMA ratio (8/21), Volume ratio,
        Close-to-open return
      

The PSO (Oscillator Power) indicator is computed from a 500-tick stochastic oscillator smoothed by EMA-25, producing a value in [-1.0, +1.0]. PSO drives the position sizing tier system and provides momentum alignment context to the ensemble. See full pipeline diagram →

3. Model Architectures

Model Architectures

The ensemble comprises three independently-trained models, each designed to capture different temporal scales and feature representations. All models produce directional signals (BUY/SELL) with confidence scores. No single model can trigger a trade alone.

F1 LSTM

The Founder

Tick-level pattern recognition

The original model that proved an edge exists in this data. A compact LSTM operating on raw tick-aggregated OHLCV bars with bracket-based labeling.

5,537 params

64.5% accuracy

(1,7,5) input shape

20t/bar aggregation

Input: 7-bar lookback of OHLCV data aggregated from 20-tick bars. Labeling: symmetric bracket (+/-5.0 pts, 2.5 pt threshold). Output: tanh activation producing BUY/SELL signal.

1x vote

F2 LightGBM

The Sniper

Highest quality per trade

Fires rarely, but when it speaks, the ensemble listens. 200-tree gradient-boosted classifier across all 54 tick features with 2x vote weight.

200 trees

54.6% accuracy

54 features

69.5% solo WR

Input: full 54-feature vector computed at each tick. Labeling: trend-based with 100-tick lookahead. Output: predict_proba producing STRONG_BUY/BUY/SELL signals. Solo performance: 131 trades, PF 1.87.

2x vote (highest quality)

ONNX 1-Min

The Workhorse

Highest solo P&L

A larger LSTM operating on 1-minute bar aggregations with technical analysis features. The volume earner that sees the bigger picture.

53K params

54.3% accuracy

(1,60,11) input

30-min horizon

Input: 60-bar lookback of 11 TA features on 1-minute OHLCV. Architecture: 2-layer LSTM(64) with dropout. Output: sigmoid activation, 30-minute prediction horizon. Solo: 278 trades, +$4,494 net.

1x vote

Training methodology: All models use temporal train/validation splits (80/20) with strict temporal ordering -- no shuffling, no future information leakage. F1 and ONNX use class-balanced sampling to handle the inherent imbalance between directional labels. F2 LightGBM uses scale_pos_weight for class balance. All three models were retrained on February 26, 2026 on 200K ticks, producing measurable accuracy improvements: F1 64.5% (was 55.5%), F2 54.6% (was 52.9%), ONNX 54.3% (was 53.8%).

4. Ensemble & Decision Pipeline

Ensemble & Decision Pipeline

The three models vote independently on every tick. Votes are weighted: F1 contributes 1 vote, F2 contributes 2 votes (reflecting its superior per-trade quality), and ONNX contributes 1 vote, for a total of 4 votes. The ensemble computes:

        consensus_score = sum(signal_values)  // BUY=+1, SELL=-1, per vote
        agreement_ratio = max(buy_votes, sell_votes) / total_votes

        // Session-tunable thresholds
        RTH, ETH_PRE:  agreement >= 25%
        ETH_POST, ASIA, EUROPE: agreement >= 40%
      

AI direction override: In earlier iterations, a mean-revert pipeline determined trade direction and models voted on confidence. The current system lets ML models directly determine trade side through ensemble consensus. This eliminated directional conflicts between rule-based signals and model predictions.

The full decision pipeline consists of 10 stages. A signal must survive all stages to become an order on the exchange:

Stage	Name	Description	Failure Mode
1	Tick Input	CME ES tick stream via Rithmic/Sierra DLL	No new tick: skip
2	Feature Extraction	54 tick features + PSO + 11 bar features	delta_pts == 0: skip
3	Model Inference	F1(1x) + F2(2x) + ONNX(1x) ensemble vote	Agreement below threshold
4	AI Evaluation	Regime detection, quality score, Kelly sizing	Confidence < 0.20
5	Gate Stack	30+ sequential entry filters	Any gate fails: blocked
6	PSO + Sizing	Oscillator alignment, final position size	CONTRA reduces to 0.5x
7	Order Execution	New entry or pyramid add via CSV/DLL	Position conflict
8	Bracket Structure	OCO stop + target, R-multiple trailing	Exchange-managed
9	Exit Management	Tick-by-tick evaluation while in position	Various exit triggers
10	Post-Trade	Record result, update statistics, cooldown	Cooldown enforced

See interactive pipeline diagram with full stage detail →

5. Risk Management

Risk Management

The risk framework operates at three levels: per-trade brackets, per-session budgets, and per-day global limits. The gate stack at Stage 5 is deliberately aggressive -- it is cheaper to miss a good trade than to take a bad one. Each gate was added to fix a specific failure mode observed in live or simulated trading.

30+ Sequential Entry Gates (92.6% rejection rate on backtest, Feb 24):

01Startup Block

10s hard lock after bridge start

02Sync Lockout

Position sync in progress

03Rate Limit

5s between orders, 6/min max

04Post-Exit Cooldown

30-60s per session after exit

05Anti-Hammer

Same price + direction + 600s block

06Chop Cooldown

3+ losses triggers 600s pause

07Edge Detector

Bayesian/Kelly/CUSUM analysis

08Calendar Blackout

FOMC, NFP, CPI event blocks

09Session Budget

Per-session loss limit enforcement

10Risk Shutdown

Global daily loss hard kill

11Smart Entry Filters

Range position, VWAP proximity

12Spread Filter

RTH 1t, Pre/Post 2t, EUR 3t, Asia 4t

13Position Check

Flat for entry, same-dir for pyramid

14Momentum Opposition

Blocks against strong momentum

15Min Hold Time

30s before stop evaluation

16PSO Warmup

Requires 500 ticks of data

17Volatility/Trend

Regime-specific filters

18Breakout Override

|accel| >= 8 bypasses range filter

Session Budgets:

Session	Stop	Target	Ensemble	Cooldown	Spread	Loss Limit
RTH	10t	25t	25%	30s	1t	$3,000
ETH Pre	10t	25t	25%	40s	2t	$350
ETH Post	10t	16t	40%	60s	2t	$350
ETH Asia	12t	16t	40%	60s	4t	$350
ETH Europe	10t	16t	40%	50s	3t	$350

PSO-driven Kelly sizing determines position size through a 9-layer soft multiplier stack. The PSO alignment value determines the sizing tier: TURBO (pso >= 0.5, 2x multiplier), BOOST (pso >= 0.25, 1.5x), NORMAL (1x), CONTRA (pso <= -0.5, 0.5x). After 3+ consecutive losses, sizing is forced to 1 lot regardless of Kelly or PSO. Maximum position: 10 lots.

Bracket management: Every entry receives an exchange-managed OCO bracket (hard stop + profit target). R-multiple trailing activates at R=0.6 (breakeven), R=1.0 (5t trail RTH / 4t ETH), R=1.5 (tight 3t/2t trail), and R=4.0 (take profit). Minimum 30-second hold time before any stop evaluation prevents the instant-stop toxicity identified in deep analysis. See full risk page →

6. Ablation Studies

Ablation Studies

All strategic decisions are validated through systematic ablation. The methodology: monkey-patch the ensemble configuration, replay the backtester on the same historical dataset, and compare 9+ configurations on identical data. This eliminates market condition variance and isolates the contribution of each component.

Central finding: Every rule-based strategy tested degraded ML ensemble performance. Three strategies were added and subsequently removed after ablation:

Removed

Momentum

Was blocking profitable ML trades. Ensemble P&L with Momentum: +$5,205. Without: +$10,663. A 105% improvement from removing it.

+105% P&L without

Removed

Mean Revert

ML trio alone: +$13,353, 65.7% WR, PF 1.49. With Mean Revert: +$8,442, PF 1.19. A 58% improvement from removal.

+58% P&L without

Removed

STAT_ARB

Tested twice, removed twice. Only profitable in RTH (+$2,001); catastrophic in ETH Europe (-$1,211, 41.5% WR).

ML-only PF 1.80 vs combined 1.58

ML vs. Rule-Based Ablation (644 trades):

Configuration	Trades	Win Rate	PF	Net P&L	MaxDD	R/MDD
ML-only (trio)	447	67.6%	1.80	+$12,633	$777	16.25x
SA-only	197	53.8%	1.15	+$616	$1,851	0.33x
Combined (ML+SA)	644	63.4%	1.58	+$13,249	$1,744	7.60x

Model Solo & Pair Performance:

Configuration	Trades	Win Rate	PF	Net P&L
ML Trio (no MR)	447	65.7%	1.49	+$13,353
ONNX Solo	278	62.9%	1.36	+$4,494
F2 Solo	131	69.5%	1.87	+$4,029
F1+F2 Pair	--	67.8%	2.01	--
ML + MR (combined)	--	--	1.19	+$8,442

Pattern: All three rule-based strategies (Momentum, Mean Revert, STAT_ARB) were killed by the same ablation process. The ML ensemble consistently performs better without rule-based overlays. This is the central empirical result of this work: let the models decide.

7. Statistical Results

Statistical Results

Performance metrics from the ablation study (ML-only configuration) and deep analysis conducted on 148K ticks of historical data. All results are from simulated trading on historical data; past results do not guarantee future performance.

Headline metrics (ML-only, 447 trades):

Metric	Value
Win Rate	67.6%
Profit Factor	1.80
Net P&L	+$12,633
Max Drawdown	$777
Return / MaxDD	16.25x
Signal Rejection Rate	92.6%
Total Signals Generated	~6,000
Trades Executed	447

Deep analysis findings (148K ticks, February 21, 2026):

Finding	Detail	Action Taken
Best Hour	11:00 ET = +$3,266 net	No time blocks applied (models trade all hours)
Kill Zones	10:00, 12:00, 19:00 ET (negative or coin flip)	Observed but not hard-blocked; models adapt
Instant Stops	37 trades under 10s = -$3,022 at 27% WR	Fixed: 30s min hold time before stop checks
Hot Hand Fallacy	Win streaks do not predict next trade outcome	No streak-based sizing (confirmed independent events)
Walk-Forward	Edge not perfectly consistent every week	Week 2 lost -$628; overall positive across all weeks
F2 Fire Rate	1 per 682 ticks, 92% in RTH	F2 given 2x vote weight due to quality

Model comparison (solo performance):

Model	Accuracy	Solo WR	Solo PF	Solo P&L	Trades	Vote Weight
F1 LSTM	64.5%	--	--	--	--	1x
F2 LightGBM	54.6%	69.5%	1.87	+$4,029	131	2x
ONNX 1m	54.3%	62.9%	1.36	+$4,494	278	1x
Ensemble (trio)	--	67.6%	1.80	+$12,633	447	--

Note on confidence intervals: With 447 trades at 67.6% WR, the 95% confidence interval for the true win rate is approximately 63.1% -- 71.9% (Wilson score interval). The profit factor of 1.80 and R/MDD of 16.25x provide additional evidence of a statistically meaningful edge, though the dataset covers a limited time period and market conditions may change.

8. Architecture

Architecture

The system is designed around a broker-agnostic abstraction layer. Two abstract base classes -- DataService (tick/DOM data) and TradeService (orders/fills) -- define the interface. Broker-specific implementations are swappable providers. All strategy logic, risk management, and ML inference lives in shared core that never changes when swapping providers.

        # Abstract interfaces (core/)
        DataService(ABC)   -> get_tick(), get_dom(), is_connected()
        TradeService(ABC)  -> send_order(), get_fills(), get_position()

        # Provider implementations
        providers/sierra/   -> SierraDataService (CSV poll), SierraTradeService (CSV orders)
        providers/ib/       -> IBDataService (TWS API), IBTradeService (TWS orders)

        # Shared strategy (never changes per broker)
        features.py         -> 54 tick features + PSO
        ensemble.py         -> Multi-model voting
        ai_trading_system.py -> Regime, quality, sizing
        exit_manager.py     -> R-multiple trailing

        # Shared core
        core/position.py    -> PositionTracker + TradeTracker
        core/risk.py        -> RiskManager (daily loss, drawdown)
        core/filters.py     -> EntryFilters (rate gate, vol, trend)
      

Current deployment: Parallels Windows 11 VM (ARM64) running Sierra Chart with a C++ DLL (T29-LocalBridge_v39.cpp) that bridges tick data and order flow to a Python bridge via CSV files. The bridge runs a 20ms polling loop, processing ~50 ticks/second during active markets.

State persistence & crash recovery: The system persists state to disk on every trade event. On restart, it reads the last known position from Sierra's fill log and reconciles with the DLL's heartbeat file. A 10-second startup lockout prevents trading on stale state.

Scaling path:

Phase	Status	Description
Current	Live	Parallels VM, Sierra Chart, 1-lot base, $50K eval (account 024)
Phase 2C	Next	Extract 200+ strategy globals, build broker-agnostic orchestrator
Azure VM	Planned	Dedicated cloud VM for Sierra; eliminates Parallels instability
Multi-Broker	Planned	IB providers complete; paper trade, then deploy alongside Sierra
Multi-Lot	Future	Scale from 1-lot to 4-6 lots as track record builds

See full architecture page with file map and service diagram →

9. Current Version

Current Version

Version	Date	Description
v81e	Feb 2026	All 3 models retrained on 200K ticks. ML ensemble determines trade direction. Session-tunable agreement thresholds. 30s minimum hold time. PSO-driven Kelly sizing. 30+ sequential entry gates.

The system follows a strict ablation-validated release process. Every model and strategy component is tested against historical data before deployment. Components that degrade performance are removed immediately -- three rule-based strategies have been killed by this process to date.

Past results are from simulated trading on historical data and do not guarantee future performance. The system is deployed in live production ($50K evaluation). All metrics should be interpreted in the context of a limited dataset and evolving market conditions.

Gradient-Boosted Scalping onES Futures

Abstract

Data & Feature Engineering

Price Dynamics

Microstructure

Session Context

Time Features

Model Architectures

The Founder

The Sniper

The Workhorse

Ensemble & Decision Pipeline

Risk Management

Ablation Studies

Momentum

Mean Revert

STAT_ARB

Statistical Results

Architecture

Current Version

Read it. Watch it. Trade it.

Gradient-Boosted Scalping on
ES Futures