The Three Gates — How CoinClaw Decides Which Bots Deserve Real Money
Key Takeaways
- Grid trading places buy and sell orders at fixed intervals around a price
- Strategy validation requires passing Monte Carlo, walk-forward, and live paper trading gates
- Automated scanning catches common patterns but manual testing finds logic flaws
Anyone can build a trading bot that looks profitable in backtesting. The hard part is knowing whether that profitability is real or just noise. CoinClaw's three-gate validation framework exists to answer that question before real money is at risk.
Most bots fail. Here's how each gate works and what it catches.
Why Validation Matters
Here's the uncomfortable truth about backtesting: any strategy can be made to look profitable on historical data. Add enough parameters, optimize enough thresholds, and you'll find a combination that would have made money in the past. That's not a strategy — that's overfitting.
The three-gate framework is designed to catch overfitting before it costs real money. Each gate tests a different aspect of strategy validity:
- Gate 1: Is the edge real, or is it random noise?
- Gate 2: Does the strategy work on data it hasn't seen?
- Gate 3: Does it work in different market conditions?
A strategy that passes all three gates isn't guaranteed to make money. But a strategy that fails any gate is almost certainly not worth trading with real capital.
Gate 1: Statistical Edge (Permutation Test)
Question: Is this strategy's performance statistically distinguishable from random trading?
Method: Permutation testing. Take the strategy's actual trades and randomly shuffle the returns thousands of times. Compare the real performance against the distribution of random performances. If the real performance is better than 95% of the random shuffles, the strategy has a statistically significant edge.
Threshold: p < 0.05 (the strategy's returns are in the top 5% of random permutations)
What it catches: Strategies that look profitable but are actually just lucky. If you flip a coin 100 times, you'll occasionally get 60 heads. That doesn't mean the coin is biased — it means you got lucky. Gate 1 is the equivalent of checking whether the coin is actually biased.
Current Results
| Bot | p-value | Result |
|---|---|---|
| V3.8 ETH Grid | 0.003 | ✅ Strong edge (0.3% chance of random) |
| BTC Grid Range | 0.030 | ✅ Significant edge (3% chance of random) |
| V3.6 F&G | 0.114 | ❌ Not significant (11.4% chance of random) |
| V3.5 Grid | 0.938 | ❌ No edge (93.8% chance of random) |
| ETH Mean Rev | 0.000 | ❌ Negative edge (worse than random) |
| SOL Breakout | 0.000 | ❌ Negative edge (worse than random) |
V3.5's p-value of 0.938 is striking. It means 93.8% of random trading strategies would have performed as well or better. V3.5 is trading real money with essentially no statistical evidence of edge.
Gate 2: Walk-Forward Efficiency (WFE)
Question: Does the strategy perform on data it wasn't optimized on?
Method: Walk-forward analysis. Split the historical data into in-sample (training) and out-of-sample (testing) periods. Optimize the strategy on the in-sample data, then test it on the out-of-sample data. Walk-Forward Efficiency is the ratio of out-of-sample performance to in-sample performance.
Threshold: WFE > 0.5 (out-of-sample performance is at least 50% of in-sample)
What it catches: Overfitting. A strategy that scores WFE = 0.1 means it performed 10x worse on unseen data than on training data. That's a classic sign of curve-fitting — the strategy memorized the training data rather than learning a generalizable pattern.
What the Numbers Mean
- WFE < 0.5: Strategy is likely overfit. Out-of-sample performance degrades significantly.
- WFE = 0.5–1.0: Acceptable. Strategy retains most of its edge on unseen data.
- WFE > 1.0: Unusual. Out-of-sample performance is better than in-sample. This can happen when the out-of-sample period has more favorable market conditions.
V3.8 ETH Grid scored WFE = 2.559 — its out-of-sample performance was 2.5x better than in-sample. This is rare and suggests the strategy captures a real market dynamic that was even more pronounced in the test period. BTC Grid Range scored WFE = 0.745, which is solid — it retained 74.5% of its edge on unseen data.
Gate 3: Regime Robustness
Question: Does the strategy work in different market conditions?
Method: Classify historical data into market regimes — bull, bear, and sideways — and calculate the Sharpe ratio for each regime separately. The strategy must show a positive Sharpe ratio in at least one regime.
Threshold: Positive Sharpe ratio in ≥1 regime
What it catches: Strategies that only work in one specific market condition that happened to dominate the backtest period. A strategy that's profitable overall but has negative Sharpe in all regimes when analyzed separately is probably benefiting from a single lucky period rather than a robust edge.
Why Regime Testing Matters
Markets cycle through regimes. A strategy that only works in bull markets will lose money during bear markets and sideways periods. If you don't know which regime your strategy works in, you don't know when to expect losses.
V3.8 ETH Grid has a positive Sharpe in bull regime (+0.218). This is honest — it tells you the strategy is designed for bull markets and may underperform in other conditions. The regime filter in V3.8 addresses this by reducing activity in non-bull regimes.
The Current Scorecard
| Bot | Gate 1 | Gate 2 | Gate 3 | Live? |
|---|---|---|---|---|
| V3.8 ETH Grid | ✅ p=0.003 | ✅ WFE=2.559 | ✅ Bull +0.218 | Deploying |
| BTC Grid Range | ✅ p=0.030 | ✅ WFE=0.745 | ✅ (passed) | Paper only |
| V3.5 Grid | ❌ p=0.938 | — | — | Live ⚠️ |
| V3.6 F&G | ❌ p=0.114 | — | — | Live ⚠️ |
| V3.7 Scalper | Not tested | — | — | Live ⚠️ |
| BTC Trend | Not tested | — | — | Paper |
| ETH Mean Rev | ❌ p=0.000 | — | — | Paused |
| SOL Breakout | ❌ p=0.000 | — | — | Paused |
The uncomfortable pattern: three bots are trading real money (V3.5, V3.6, V3.7) without passing Gate 1. The two bots that passed all three gates are either paper-only (BTC Grid Range) or just now deploying (V3.8). The validation framework was built after the first bots went live — it's being applied retroactively.
What the Gates Teach Us
1. Most Strategies Don't Have an Edge
Of the 6 strategies tested against Gate 1, only 2 passed. That's a 33% pass rate — and that's probably generous, since these strategies were already selected as the most promising candidates. In a broader universe of trading strategies, the pass rate would be much lower.
2. Passing Gate 1 Isn't Enough
A strategy can have a statistically significant edge (Gate 1) but still be overfit (Gate 2) or regime-dependent (Gate 3). All three gates are necessary. V3.6 came close on Gate 1 (p=0.114) but would likely fail Gate 2 — its Fear & Greed sentiment filter is based on a single indicator that may not generalize.
3. Validation Before Live Trading Is Non-Negotiable
V3.5 has been running with real money for over a month with zero realized PnL and a p-value of 0.938. That's real capital sitting in a strategy with no statistical evidence of edge. The three-gate framework exists to prevent exactly this situation.
4. The Framework Is Honest About Limitations
Passing all three gates doesn't guarantee profitability. V3.8 has a positive Sharpe only in bull markets — if the market turns bearish, it will underperform. The framework doesn't promise success; it filters out strategies that are almost certainly going to fail.
Bottom Line
The three-gate framework is simple: test for edge, test for generalization, test for robustness. Most strategies fail at Gate 1. The ones that pass all three gates aren't guaranteed to make money, but they've earned the right to try with real capital.
V3.8 is the first bot to earn that right through the full framework. Whether it delivers is the next chapter of the competition.