System Testing Basics

What a backtest actually measures

A backtest runs your rules over historical data and produces a series of simulated trades. What you do with that series is where most people go wrong. A single "total return" number tells you almost nothing. Two systems with identical 15% CAGR can have wildly different risk profiles — one grinds up a smooth equity curve while the other spends eight months underwater before a single monster trade drags it into the green.

Kaufman frames the mindset directly:

Testing is only as good as its weakest element: data quality, the representativeness of the sample, the realism of the rules, and the honesty of the analyst.

Everything in this lesson is about turning a trade log into a verdict you can defend.

Kaufman's 10 essential metrics

Kaufman lists the core performance measures every backtest should report (kaufman.txt:37804–37824). Condensed, they are:

Net profit (absolute and as a % of starting equity)
Number of trades — too few and the results are noise
% profitable (win rate)
Average trade (profit and loss, then combined)
Largest winning trade / largest losing trade
Maximum drawdown — peak-to-trough equity decline
Profit factor — gross winners / gross losers
Average bars in trades — winners vs losers
Ratio of largest win : largest loss
Consecutive wins / consecutive losses — can you sit through the worst streak?

These don't stand alone — they're a panel. Looking at win rate without the win/loss magnitude ratio is worthless; looking at net profit without drawdown is worse.

Expectancy — the one number that matters most

Kaufman's definition (kaufman.txt:44225–44235) of expectancy per trade:

E = P_{win} \times \overline{Win} - P_{loss} \times \overline{Loss}

Or normalized as an R-multiple (where 1R = your fixed risk per trade):

E_R = B \cdot (1 + R) - 1

where B = win rate and R = reward-to-risk ratio.

Positive expectancy = the system makes money on average. You can have a 30% win rate and positive expectancy if winners are 3× the size of losers. You can have a 70% win rate and negative expectancy if losers are 3× the size of winners. Win rate alone is a lie — it's one of three knobs and not even the most important one.

Risk-adjusted returns — why absolute return isn't enough

A 20% return with a 5% max drawdown is a radically different system from a 20% return with a 40% max drawdown. Both print the same headline. The second one will make you quit before it pays you back.

Sharpe Ratio (`kaufman.txt:42112–42180`)

SR = \frac{\overline{R_p} - R_f}{\sigma_p}

Excess return over risk-free rate, divided by return volatility. The gold standard since William Sharpe introduced it — but it penalizes upside volatility the same as downside, which is why practitioners reach for Sortino.

Sortino Ratio (`kaufman.txt:42200–42246`)

Same numerator; denominator uses downside deviation only — the standard deviation of negative returns. A system with big upside swings and small downside gets rewarded, not punished. Kaufman: use Sortino when your return distribution is skewed (which is every trend system ever).

Calmar Ratio (`kaufman.txt:42286–42310`)

Calmar = \frac{\text{CAGR}}{|\text{Max Drawdown}|}

Return per unit of worst historical drawdown. A Calmar of 1.0 means your annualized return equals your worst drawdown — you spend a year earning back what the system takes from you at its worst. Funds target Calmar > 0.5 as a rough floor; > 1.0 is considered good.

Ulcer Index (`kaufman.txt:42320–42326`)

Measures the depth and duration of drawdowns — a system that recovers in a week is better than one that grinds underwater for six months, even if the maximum drawdown is identical. Ulcer is the tool for measuring that difference.

MAE and MFE — what happened inside the trade

Standard backtest outputs only show trade entry and exit. But for diagnosis, you want to know what happened between.

MAE (Maximum Adverse Excursion): the worst the trade got before you exited. If your winners routinely draw down 3× your eventual profit before rebounding, your stop placement is a liability.
MFE (Maximum Favorable Excursion): how far the trade went in your favor before you exited. If your winners routinely reach 3R and you exit at 1R, your exits are too tight.

Kaufman credits these to John Sweeney's 1996 book Maximum Adverse Excursion (kaufman.txt:45001) — the idea being that the shape of intra-trade price action contains information about where to place stops and exits, independent of where you actually did. Plot MAE vs outcome and patterns appear: winning trades rarely exceed a given MAE threshold, and you can use that threshold as a stop-loss that minimizes false exits.

What "statistical significance" buys you

A system with 12 trades that returned 40% means nothing. A system with 400 trades returning 40% is a signal. Kaufman:

A system cannot be reliably evaluated on fewer than 30 trades, and many researchers consider 100 the practical minimum.

Two mechanisms underneath:

Sample size — your win rate estimate has a 95% confidence interval of roughly ±10% at 100 trades, ±3% at 1000. Small samples produce huge uncertainty.
Degrees of freedom — each parameter you optimize costs you information. A 5-parameter system tested on 30 trades has fit itself to noise; the out-of-sample degradation will be severe.

Practical rule: trades-per-parameter > 10× at minimum, > 30× for comfort.

The minimum honest report

After a backtest, you should be able to produce this in one page:

Total trades, win rate, profit factor
Net return and CAGR
Max drawdown (in % and duration in months)
Sharpe, Sortino, and Calmar
Expectancy per trade (in $ and R)
Largest winner and loser; average winner and loser
Worst losing streak (number and $)
Out-of-sample slice — critical. A backtest without a held-out window is fiction.

Anyone reporting just "X% return over Y years" is selling, not testing.

Quick check

Question 1 / 20 correct

System A: 70% win rate, avg win $50, avg loss $200. System B: 30% win rate, avg win $400, avg loss $100. Which has positive expectancy?

What you now know

A backtest is a trade log; the 10 Kaufman metrics turn that log into a verdict
Expectancy = win_rate × avg_win − loss_rate × avg_loss is the one number that matters most
Sharpe, Sortino, Calmar, Ulcer measure different flavors of risk-adjusted return — use the one that matches your distribution
MAE and MFE reveal the shape of intra-trade action and inform stop/target placement
Below 100 trades, your metrics are noise; below 30 they're fantasy
Every honest backtest reports out-of-sample results

Next: Backtesting Pitfalls — the seven ways backtests lie to you, and how to catch each one before it costs you real money.