Backtesting Pitfalls · learning TA

The seven ways a backtest lies

A backtest that says "28% CAGR" can mean one of two things. Either you have a genuine edge and the next year might deliver close to that. Or — far more commonly — you've accidentally baked knowledge about the future, or the specific history of your data, into rules that will immediately collapse in live trading. This lesson catalogs the seven most common failures, in roughly the order they bite.

1. Lookahead bias

What it is: your rules use data that wouldn't have been available at the decision moment.

The classic form: computing today's moving average including today's close, then acting on a cross detected today. In reality, today's close is known at 4:00pm; you can't enter the trade before 4:00pm using that cross signal. Kaufman's caution (kaufman.txt:39316):

If you use today's close to trigger today's entry at today's open, you've built time travel into your backtest.

Subtler forms that catch everyone eventually:

Using adjusted close for splits/dividends that were announced later than your signal date
Normalizing with max or min over the full series (including the future)
z-score normalization using full-sample mean and std instead of an expanding window
Delisted-symbol exclusion applied to the historical series before the delisting actually happened
Indicator parameter selection done with full-sample knowledge

Detection: rerun the backtest shifting signal data back by one bar. If performance collapses, you had lookahead. If it barely moves, you're likely clean.

2. Survivorship bias

What it is: your universe of tradeable symbols silently excludes everything that failed.

Most free historical data sources list only currently-traded symbols. So a "backtest on the S&P 500" using today's constituents is actually backtesting companies that were healthy enough to stay in the index for the last N years. Enron, Lehman, Worldcom, Sears, Pacific Gas — all missing. Your strategy looks far better than it would have in real-time because it never got to pick Enron.

Kaufman (kaufman.txt:37789):

A universe that changes through time must be tested with point-in-time constituency, or the results will systematically overstate returns.

Magnitude estimates from academic work suggest survivorship alone can inflate backtested equity-strategy returns by 1–3% per year. Compound that over 10 years and a break-even strategy looks like a winner.

Fix: use point-in-time constituency data (CRSP or historical index membership lists). If you can't, be honest in the footer.

3. Curve-fitting (overfitting)

What it is: you've optimized parameters until the backtest looks great, but the parameters describe historical noise, not a real pattern.

Kaufman (kaufman.txt:37595, 37644):

If you test enough parameter combinations you will find one that worked in the past. That does not mean it will work in the future — only that noise is high-dimensional enough to look like signal.

Symptoms of curve-fit results:

The best parameters are isolated peaks in a grid search (neighbors perform terribly)
Small changes in parameters → large changes in performance
Performance degrades sharply once you cross the start of the held-out window
Trade count is small relative to the number of parameters

The gold standard test: parameter robustness. Plot performance as a function of each parameter individually. The optimal region should be a plateau, not a spike. If a 10% change in your RSI period moves Sharpe from 1.8 to 0.3, that 1.8 is curve-fit.

4. Data-snooping bias (multiple testing)

Related to curve-fitting but distinct: you test 1000 strategies, pick the best one, report its performance. Even if all 1000 were pure noise, the best of 1000 will look spectacular.

Aronson's Evidence-Based Technical Analysis (not in our library — cited through common practice) spent most of its 500 pages on this exact problem. The fix is either (a) a White's Reality Check or Superior Predictive Ability test that adjusts for multiple comparisons, or (b) a pre-registered hypothesis tested on data you've never seen.

Simple retail-friendly proxy: for every N strategies you tested to find the winner, divide your reported edge by roughly √N. That's your de-biased estimate. If it's still positive, you might have something.

5. In-sample / out-of-sample violations

The minimum discipline: split your history into three pieces.

In-sample (IS): ~60% — used to develop rules
Validation: ~20% — used to choose between candidate systems
Out-of-sample (OOS): ~20% — touched exactly once, at the end

The sin is letting OOS bleed back into development. Every time you look at OOS results and tweak the rules, that window becomes in-sample. Most retail "OOS validation" fails this way: the trader iterates on the design, checks OOS, doesn't like it, tweaks, rechecks. Within three cycles the OOS is contaminated.

The only clean OOS is one you cross exactly once, write down the result, and live with it — whether you like the number or not.

6. Walk-forward analysis (the honest version)

Kaufman (kaufman.txt:37261):

Walk-forward optimization re-trains the system on a rolling window and tests on the subsequent untouched window, advancing through history.

Pseudo-procedure:

Use months 1–24 to optimize parameters
Trade those parameters in months 25–30 (no re-optimization)
Slide the window: optimize on months 7–30, trade 31–36
Repeat to end of history
Concatenate all the "trade" segments — that's your honest equity curve

This approximates what live trading would have produced, including the inevitable periods when the old parameters are stale. Walk-forward performance is typically 30–60% of in-sample optimized performance; if yours shows no degradation, something is wrong.

Walk-forward vs in-sample

IS final equity$12077

WF final equity$11722

IS → WF degradation17%

OOS windows16

The green line is a lie. Parameters were optimized on the full price history — the backtest had perfect hindsight. This is what your optimizer shows you; it is not what live trading would produce.

The green curve is what your optimizer shows you — it had full hindsight over the price history. The amber curve is the walk-forward honest estimate: each OOS segment used parameters chosen on the preceding window only. Toggle between them, then overlay both to see the gap. That gap is your curve-fit tax.

7. Slippage, commissions, and borrow costs

Backtests often assume: fills at mid, no commission, no short-borrow cost, no impact. All four are wrong in real trading.

Slippage: in liquid large-caps, 1–5 bps per trade; in small-caps, 10–50 bps or more. A strategy trading 200 times a year with 5 bps slippage loses 1% CAGR to frictions alone.
Commission: ~0 at retail brokers now, but futures and options still carry meaningful per-contract fees.
Short borrow: shorting hot names costs 5–50% annualized; some tickers are not borrowable at all.
Impact: if your paper portfolio trades 10% of daily volume, your real orders move the price; your backtest never does.

Rough sanity: subtract at least 0.10% per round-trip trade before you declare a strategy viable. Intraday strategies need more; monthly rebalancing strategies can get away with less.

A concrete story: Bulkowski's ultimate-high methodology

Bulkowski's Encyclopedia of Chart Patterns is careful about this exact problem. His "ultimate high" methodology (bulkowski.txt:4609, 4690) measures patterns not by "did price reach target?" but by "how far did price run from breakout before reversing 20%?" The discipline: you cannot look forward from the breakout to cherry-pick a high. You mark the first 20% reversal after breakout and measure to that point, no matter what happened after.

This sounds trivial until you try to implement it. The naive version uses max(price[breakout:]) — which is lookahead. The correct version iterates forward, maintaining running max, and triggers on the first (running_max − price) / running_max >= 0.20 event. The difference in reported pattern effectiveness is substantial.

Bulkowski's unified event-study framing

Worth internalizing: most of Bulkowski's statistics are event studies, not backtests. An event study measures what happens in a window around a recurring event (a head-and-shoulders breakout, an earnings release, a day-of-week effect). The distinction matters because:

Event studies have no position sizing assumption — you're measuring the move, not the equity curve
They're less prone to curve-fitting because there are fewer parameters
But they're also not directly tradeable — the "percent move from breakout" does not equal "percent return on your portfolio," which depends on sizing, stops, and pyramiding

So when Bulkowski reports 48% of ascending triangles meet target, that is an event statistic, not an equity-curve result. Converting to a tradeable P&L requires adding a stop rule, a position-sizing rule, and commission. Your mileage will vary.

Quick check

Question 1 / 20 correct

You backtest a strategy on the current S&P 500 constituents going back to 1990. Results look amazing. Which pitfall is almost certainly inflating the numbers?

What you now know

Lookahead bias uses future info; detect by shifting signals back one bar
Survivorship bias silently excludes failures; fix with point-in-time constituency
Curve-fitting optimizes parameters onto noise; demand parameter plateaus, not spikes
Data-snooping inflates the best of N trials; de-bias by roughly √N
IS/OOS discipline means touching OOS exactly once, ever
Walk-forward analysis approximates live trading; expect 30–60% degradation from optimized IS
Slippage + commissions + borrow eat 0.10–0.50% per round-trip in practice

Next: Walk-Forward & Monte Carlo — the gold standard for out-of-sample validation, turning backtesting discipline into statistical confidence.