Home Strategy Performance Research Live Dashboard
Abstract Features Models Ensemble Risk Ablation Results Architecture Version

Gradient-Boosted Scalping on
ES Futures

A multi-model approach to automated trade signal generation, combining tick-level LSTMs, gradient-boosted trees, and bar-level time-series models with adaptive risk management and systematic ablation-driven development.

BHF Capital / Keystone Strategy v035.350 March 2026 R. G. Williams
1. Abstract

Abstract

We present a hybrid machine learning system for intraday trading of E-mini S&P 500 futures (ES), deployed in live production on CME via Rithmic execution. The system combines three independently-trained models -- an LSTM for tick-level pattern recognition (5,537 parameters, 64.5% accuracy), a LightGBM classifier on 54 engineered features (69.5% win rate solo, 2x ensemble vote weight), and a bar-level ONNX LSTM for 1-minute time-series prediction (53K parameters, highest solo P&L at +$4,494) -- into a weighted ensemble that requires minimum agreement before any signal reaches execution.

The system processes 54 tick-level features and 11 bar-level technical indicators through a 10-stage decision pipeline with 30+ sequential entry gates, achieving a 92.6% signal rejection rate. Only signals surviving all gates reach the exchange.

Development follows an ablation-first methodology: every model and rule-based strategy is validated through systematic ablation studies on historical data before deployment. Three rule-based strategies (Momentum, Mean Revert, STAT_ARB) were added and subsequently removed after ablation proved they degraded ML ensemble performance. The resulting ML-only configuration achieves 65.7% win rate, 1.80 profit factor, and 16.25x return-to-max-drawdown ratio across 447 simulated trades.

Key claim: In this domain, ML models consistently outperform when trading alone. Every rule-based heuristic tested -- momentum, mean-reversion, and statistical arbitrage -- degraded ensemble performance when measured through ablation. The edge comes from letting the models decide, not from layering human intuition on top.

2. Data & Feature Engineering

Data & Feature Engineering

The system ingests tick-level market data from CME E-mini S&P 500 futures (ES) via Rithmic, captured by a Sierra Chart DLL (T29-LocalBridge_v39.cpp) that writes two CSV streams in real time. The Python bridge reads these files in a 20ms polling loop.

// Tick CSV format (6 fields, v36+) seq, ms, price, bid_vol, ask_vol, total_vol // DOM CSV format (10 bid + 10 ask levels) seq, ms, last_px, bid_vol, ask_vol, N, b0_px, b0_sz, ..., a9_px, a9_sz

The training dataset consists of ~200K raw ticks collected from live ES H6 futures trading, supplemented by a 614K tick backfill (ticks_RGW.csv). All three models were retrained on February 26, 2026 using the 200K tick dataset with temporal train/validation splits (80/20) and no future-peeking.

The feature engine (features.py) computes 54 tick-level features on every incoming tick, organized into the following categories:

Price Dynamics

12 features

Multi-scale returns (5t, 10t, 20t, 50t, 100t), realized volatility at multiple horizons, momentum indicators, price acceleration, and velocity of price change.

Microstructure

15 features

Bid-ask spread, trade flow imbalance, order flow toxicity (VPIN proxy), DOM imbalance across 10 levels, volume surge detection, and trade intensity metrics.

Session Context

8 features

Position within session range, distance from VWAP, distance from session high/low, range expansion rate, and session-relative volume metrics.

Time Features

19 features

Cyclical encoding of hour/minute (sin/cos), session flags (RTH/ETH), time-to-session-close, day-of-week encoding, and calendar event proximity indicators.

All tick features are normalized to a [-10, 10] range using rolling statistics. Additionally, an 11-feature bar-level pipeline aggregates 1-minute OHLCV bars and computes standard technical indicators used exclusively by the ONNX model:

# 11 Bar Features (1-min OHLCV aggregation) RSI(14), ATR(14), MACD signal, MACD histogram, Bollinger %B, Bollinger bandwidth, Stochastic %K, Stochastic %D, EMA ratio (8/21), Volume ratio, Close-to-open return

The PSO (Oscillator Power) indicator is computed from a 500-tick stochastic oscillator smoothed by EMA-25, producing a value in [-1.0, +1.0]. PSO drives the position sizing tier system and provides momentum alignment context to the ensemble. See full pipeline diagram →

3. Model Architectures

Model Architectures

The ensemble comprises three independently-trained models, each designed to capture different temporal scales and feature representations. All models produce directional signals (BUY/SELL) with confidence scores. No single model can trigger a trade alone.

F1 LSTM

The Founder

Tick-level pattern recognition

The original model that proved an edge exists in this data. A compact LSTM operating on raw tick-aggregated OHLCV bars with bracket-based labeling.

5,537 params
64.5% accuracy
(1,7,5) input shape
20t/bar aggregation

Input: 7-bar lookback of OHLCV data aggregated from 20-tick bars. Labeling: symmetric bracket (+/-5.0 pts, 2.5 pt threshold). Output: tanh activation producing BUY/SELL signal.

1x vote
F2 LightGBM

The Sniper

Highest quality per trade

Fires rarely, but when it speaks, the ensemble listens. 200-tree gradient-boosted classifier across all 54 tick features with 2x vote weight.

200 trees
54.6% accuracy
54 features
69.5% solo WR

Input: full 54-feature vector computed at each tick. Labeling: trend-based with 100-tick lookahead. Output: predict_proba producing STRONG_BUY/BUY/SELL signals. Solo performance: 131 trades, PF 1.87.

2x vote (highest quality)
ONNX 1-Min

The Workhorse

Highest solo P&L

A larger LSTM operating on 1-minute bar aggregations with technical analysis features. The volume earner that sees the bigger picture.

53K params
54.3% accuracy
(1,60,11) input
30-min horizon

Input: 60-bar lookback of 11 TA features on 1-minute OHLCV. Architecture: 2-layer LSTM(64) with dropout. Output: sigmoid activation, 30-minute prediction horizon. Solo: 278 trades, +$4,494 net.

1x vote

Training methodology: All models use temporal train/validation splits (80/20) with strict temporal ordering -- no shuffling, no future information leakage. F1 and ONNX use class-balanced sampling to handle the inherent imbalance between directional labels. F2 LightGBM uses scale_pos_weight for class balance. All three models were retrained on February 26, 2026 on 200K ticks, producing measurable accuracy improvements: F1 64.5% (was 55.5%), F2 54.6% (was 52.9%), ONNX 54.3% (was 53.8%).

4. Ensemble & Decision Pipeline

Ensemble & Decision Pipeline

The three models vote independently on every tick. Votes are weighted: F1 contributes 1 vote, F2 contributes 2 votes (reflecting its superior per-trade quality), and ONNX contributes 1 vote, for a total of 4 votes. The ensemble computes:

consensus_score = sum(signal_values) // BUY=+1, SELL=-1, per vote agreement_ratio = max(buy_votes, sell_votes) / total_votes // Session-tunable thresholds RTH, ETH_PRE: agreement >= 25% ETH_POST, ASIA, EUROPE: agreement >= 40%

AI direction override: In earlier iterations, a mean-revert pipeline determined trade direction and models voted on confidence. The current system lets ML models directly determine trade side through ensemble consensus. This eliminated directional conflicts between rule-based signals and model predictions.

The full decision pipeline consists of 10 stages. A signal must survive all stages to become an order on the exchange:

StageNameDescriptionFailure Mode
1Tick InputCME ES tick stream via Rithmic/Sierra DLLNo new tick: skip
2Feature Extraction54 tick features + PSO + 11 bar featuresdelta_pts == 0: skip
3Model InferenceF1(1x) + F2(2x) + ONNX(1x) ensemble voteAgreement below threshold
4AI EvaluationRegime detection, quality score, Kelly sizingConfidence < 0.20
5Gate Stack30+ sequential entry filtersAny gate fails: blocked
6PSO + SizingOscillator alignment, final position sizeCONTRA reduces to 0.5x
7Order ExecutionNew entry or pyramid add via CSV/DLLPosition conflict
8Bracket StructureOCO stop + target, R-multiple trailingExchange-managed
9Exit ManagementTick-by-tick evaluation while in positionVarious exit triggers
10Post-TradeRecord result, update statistics, cooldownCooldown enforced

See interactive pipeline diagram with full stage detail →

5. Risk Management

Risk Management

The risk framework operates at three levels: per-trade brackets, per-session budgets, and per-day global limits. The gate stack at Stage 5 is deliberately aggressive -- it is cheaper to miss a good trade than to take a bad one. Each gate was added to fix a specific failure mode observed in live or simulated trading.

30+ Sequential Entry Gates (92.6% rejection rate on backtest, Feb 24):

01Startup Block
10s hard lock after bridge start
02Sync Lockout
Position sync in progress
03Rate Limit
5s between orders, 6/min max
04Post-Exit Cooldown
30-60s per session after exit
05Anti-Hammer
Same price + direction + 600s block
06Chop Cooldown
3+ losses triggers 600s pause
07Edge Detector
Bayesian/Kelly/CUSUM analysis
08Calendar Blackout
FOMC, NFP, CPI event blocks
09Session Budget
Per-session loss limit enforcement
10Risk Shutdown
Global daily loss hard kill
11Smart Entry Filters
Range position, VWAP proximity
12Spread Filter
RTH 1t, Pre/Post 2t, EUR 3t, Asia 4t
13Position Check
Flat for entry, same-dir for pyramid
14Momentum Opposition
Blocks against strong momentum
15Min Hold Time
30s before stop evaluation
16PSO Warmup
Requires 500 ticks of data
17Volatility/Trend
Regime-specific filters
18Breakout Override
|accel| >= 8 bypasses range filter

Session Budgets:

SessionStopTargetEnsembleCooldownSpreadLoss Limit
RTH10t25t25%30s1t$3,000
ETH Pre10t25t25%40s2t$350
ETH Post10t16t40%60s2t$350
ETH Asia12t16t40%60s4t$350
ETH Europe10t16t40%50s3t$350

PSO-driven Kelly sizing determines position size through a 9-layer soft multiplier stack. The PSO alignment value determines the sizing tier: TURBO (pso >= 0.5, 2x multiplier), BOOST (pso >= 0.25, 1.5x), NORMAL (1x), CONTRA (pso <= -0.5, 0.5x). After 3+ consecutive losses, sizing is forced to 1 lot regardless of Kelly or PSO. Maximum position: 10 lots.

Bracket management: Every entry receives an exchange-managed OCO bracket (hard stop + profit target). R-multiple trailing activates at R=0.6 (breakeven), R=1.0 (5t trail RTH / 4t ETH), R=1.5 (tight 3t/2t trail), and R=4.0 (take profit). Minimum 30-second hold time before any stop evaluation prevents the instant-stop toxicity identified in deep analysis. See full risk page →

6. Ablation Studies

Ablation Studies

All strategic decisions are validated through systematic ablation. The methodology: monkey-patch the ensemble configuration, replay the backtester on the same historical dataset, and compare 9+ configurations on identical data. This eliminates market condition variance and isolates the contribution of each component.

Central finding: Every rule-based strategy tested degraded ML ensemble performance. Three strategies were added and subsequently removed after ablation:

Removed

Momentum

Was blocking profitable ML trades. Ensemble P&L with Momentum: +$5,205. Without: +$10,663. A 105% improvement from removing it.

+105% P&L without
Removed

Mean Revert

ML trio alone: +$13,353, 65.7% WR, PF 1.49. With Mean Revert: +$8,442, PF 1.19. A 58% improvement from removal.

+58% P&L without
Removed

STAT_ARB

Tested twice, removed twice. Only profitable in RTH (+$2,001); catastrophic in ETH Europe (-$1,211, 41.5% WR).

ML-only PF 1.80 vs combined 1.58

ML vs. Rule-Based Ablation (644 trades):

ConfigurationTradesWin RatePFNet P&LMaxDDR/MDD
ML-only (trio)44767.6%1.80+$12,633$77716.25x
SA-only19753.8%1.15+$616$1,8510.33x
Combined (ML+SA)64463.4%1.58+$13,249$1,7447.60x

Model Solo & Pair Performance:

ConfigurationTradesWin RatePFNet P&L
ML Trio (no MR)44765.7%1.49+$13,353
ONNX Solo27862.9%1.36+$4,494
F2 Solo13169.5%1.87+$4,029
F1+F2 Pair--67.8%2.01--
ML + MR (combined)----1.19+$8,442

Pattern: All three rule-based strategies (Momentum, Mean Revert, STAT_ARB) were killed by the same ablation process. The ML ensemble consistently performs better without rule-based overlays. This is the central empirical result of this work: let the models decide.

7. Statistical Results

Statistical Results

Performance metrics from the ablation study (ML-only configuration) and deep analysis conducted on 148K ticks of historical data. All results are from simulated trading on historical data; past results do not guarantee future performance.

Headline metrics (ML-only, 447 trades):

MetricValue
Win Rate67.6%
Profit Factor1.80
Net P&L+$12,633
Max Drawdown$777
Return / MaxDD16.25x
Signal Rejection Rate92.6%
Total Signals Generated~6,000
Trades Executed447

Deep analysis findings (148K ticks, February 21, 2026):

FindingDetailAction Taken
Best Hour11:00 ET = +$3,266 netNo time blocks applied (models trade all hours)
Kill Zones10:00, 12:00, 19:00 ET (negative or coin flip)Observed but not hard-blocked; models adapt
Instant Stops37 trades under 10s = -$3,022 at 27% WRFixed: 30s min hold time before stop checks
Hot Hand FallacyWin streaks do not predict next trade outcomeNo streak-based sizing (confirmed independent events)
Walk-ForwardEdge not perfectly consistent every weekWeek 2 lost -$628; overall positive across all weeks
F2 Fire Rate1 per 682 ticks, 92% in RTHF2 given 2x vote weight due to quality

Model comparison (solo performance):

ModelAccuracySolo WRSolo PFSolo P&LTradesVote Weight
F1 LSTM64.5%--------1x
F2 LightGBM54.6%69.5%1.87+$4,0291312x
ONNX 1m54.3%62.9%1.36+$4,4942781x
Ensemble (trio)--67.6%1.80+$12,633447--

Note on confidence intervals: With 447 trades at 67.6% WR, the 95% confidence interval for the true win rate is approximately 63.1% -- 71.9% (Wilson score interval). The profit factor of 1.80 and R/MDD of 16.25x provide additional evidence of a statistically meaningful edge, though the dataset covers a limited time period and market conditions may change.

8. Architecture

Architecture

The system is designed around a broker-agnostic abstraction layer. Two abstract base classes -- DataService (tick/DOM data) and TradeService (orders/fills) -- define the interface. Broker-specific implementations are swappable providers. All strategy logic, risk management, and ML inference lives in shared core that never changes when swapping providers.

# Abstract interfaces (core/) DataService(ABC) -> get_tick(), get_dom(), is_connected() TradeService(ABC) -> send_order(), get_fills(), get_position() # Provider implementations providers/sierra/ -> SierraDataService (CSV poll), SierraTradeService (CSV orders) providers/ib/ -> IBDataService (TWS API), IBTradeService (TWS orders) # Shared strategy (never changes per broker) features.py -> 54 tick features + PSO ensemble.py -> Multi-model voting ai_trading_system.py -> Regime, quality, sizing exit_manager.py -> R-multiple trailing # Shared core core/position.py -> PositionTracker + TradeTracker core/risk.py -> RiskManager (daily loss, drawdown) core/filters.py -> EntryFilters (rate gate, vol, trend)

Current deployment: Parallels Windows 11 VM (ARM64) running Sierra Chart with a C++ DLL (T29-LocalBridge_v39.cpp) that bridges tick data and order flow to a Python bridge via CSV files. The bridge runs a 20ms polling loop, processing ~50 ticks/second during active markets.

State persistence & crash recovery: The system persists state to disk on every trade event. On restart, it reads the last known position from Sierra's fill log and reconciles with the DLL's heartbeat file. A 10-second startup lockout prevents trading on stale state.

Scaling path:

PhaseStatusDescription
CurrentLiveParallels VM, Sierra Chart, 1-lot base, $50K eval (account 024)
Phase 2CNextExtract 200+ strategy globals, build broker-agnostic orchestrator
Azure VMPlannedDedicated cloud VM for Sierra; eliminates Parallels instability
Multi-BrokerPlannedIB providers complete; paper trade, then deploy alongside Sierra
Multi-LotFutureScale from 1-lot to 4-6 lots as track record builds

See full architecture page with file map and service diagram →

9. Current Version

Current Version

VersionDateDescription
v81eFeb 2026All 3 models retrained on 200K ticks. ML ensemble determines trade direction. Session-tunable agreement thresholds. 30s minimum hold time. PSO-driven Kelly sizing. 30+ sequential entry gates.

The system follows a strict ablation-validated release process. Every model and strategy component is tested against historical data before deployment. Components that degrade performance are removed immediately -- three rule-based strategies have been killed by this process to date.

Past results are from simulated trading on historical data and do not guarantee future performance. The system is deployed in live production ($50K evaluation). All metrics should be interpreted in the context of a limited dataset and evolving market conditions.