AdvancedLesson 9 of 107 min read

Backtesting without fooling yourself

A backtest is a simulation of how a strategy would have performed on historical data, and it is the most abused tool in trading. The abuse is rarely fraud. It is self-deception with a spreadsheet: a series of small, individually defensible choices — this date range, this parameter, this clean dataset — that collectively manufacture a beautiful equity curve no live account will ever see. The uncomfortable base rate: most backtests that look spectacular are wrong, and the ways they are wrong are so well-catalogued that producing an honest one is mostly a matter of refusing, methodically, to cheat.

This lesson is that catalogue. The goal of a backtest is not to make a strategy look good — it is to try as hard as possible to kill the strategy, and to trade only what survives.

Lookahead bias: trading on tomorrow’s newspaper

Lookahead bias means your simulation uses information that was not available at the moment of the simulated decision. The crude version is obvious — computing a signal from a day’s closing price and “executing” at that day’s open. The insidious versions hide in data plumbing: indicators computed over a window that includes the current, still-forming bar; economic data stamped at its reference date rather than its release date (a quarterly figure describing March was not knowable until weeks later); datasets that were revised after the fact, so your simulation reacts to the corrected number nobody saw in real time. Even one leaky field can dominate results, because information from the future is the single most profitable input a strategy can have. The discipline: for every input, ask not “what is the value?” but “at what moment did this value become knowable?” — and timestamp accordingly.

Survivorship bias: the graveyard is missing

Test a strategy on the current constituents of any asset universe and you have quietly conditioned on survival. Every delisted stock, every collapsed token, every instrument that bled out and vanished is absent from the sample — and those are precisely the assets your strategy might have bought on the way down. A buy-the-dip strategy backtested on today’s survivors looks brilliant for the simplest reason imaginable: every asset in the test, by construction, eventually recovered. The fix is point-in-time universes — testing each historical date against the instruments that actually existed and were tradeable then. Where such data is unavailable, treat results on dip-buying and mean-reversion strategies with active suspicion, because survivorship flatters exactly those.

Overfitting: torture the data and it confesses

Try enough parameter combinations and some will perform superbly on your historical sample by pure chance. Optimize a moving-average crossover over 50 lookback values and 50 thresholds and you have run 2,500 experiments; the best one is virtually guaranteed to look stellar and to mean nothing. The killer is that the resulting equity curve is indistinguishable from genuine edge — the fraud is invisible in the artifact and lives entirely in the process that produced it. Warning signs and countermeasures:

Parameter fragility. If a lookback of 19 is great but 17 and 21 are mediocre, you found noise. Real effects are plateaus, not pinnacles — performance should degrade gently as parameters move.
Complexity creep. Every added rule, filter, and exception fits the past better and predicts the future worse. A strategy whose description needs a paragraph of special cases is a memoir, not a model.
Multiplicity accounting. Honest practice counts every variant tried — including the abandoned ones — because the more you searched, the higher the bar your best result must clear to be evidence rather than selection.
The deleted-failures test. If you cannot list the versions that did not work, you no longer know how hard you searched, and neither does your backtest.

Fees and slippage: where paper profits go to die

Frictionless backtests are fiction by construction. Run the arithmetic once and you will never skip it again: a strategy trading one round trip per day at taker fees of 0.25% per side — Obsidiate’s Bronze tier — pays 0.5% per round trip, roughly 125% of capital per year in fees alone at 250 trading days. A high-frequency-ish signal that grosses 30% annually is, after that drag, a machine for converting your capital into exchange revenue. The same strategy at Diamond taker rates (0.10%) loses 50% annually to fees — still fatal. The lesson generalizes: fee tier and maker-versus-taker execution are not implementation details; for active strategies they are the difference between existence and nonexistence. Slippage modeling matters just as much: assume fills at the spread’s far side, not the mid; haircut or exclude fills your size could not realistically have gotten; and remember that stop-loss exits in fast markets fill worse than the trigger, sometimes much worse. When in doubt, double your friction estimate and see if the strategy survives. Real conditions are reliably worse than modeled ones, never better.

Before optimizing anything, compute one number: annual fee drag at your actual tier and trade frequency. If gross expected edge is not a comfortable multiple of that number, stop — no parameter search can rescue a strategy whose costs exceed its alpha.

Out-of-sample discipline

The defense against all of the above is data your development process never touched. The standard structure: build and tune on an in-sample period, then evaluate — once — on a held-out out-of-sample period. The “once” is the entire point. Peek at the holdout, adjust, and re-evaluate, and you have silently converted it into in-sample data; the second look is already contaminated. Walk-forward analysis extends the idea — fit on a window, test on the next slice, roll forward, concatenate only the test slices — which also reveals whether the edge persists across regimes or lived in one lucky stretch. The final and least forgiving holdout is paper trading or minimum-size live trading: data that arrives after development ends is the only sample you provably could not have fit. Expect live performance below backtest performance; the gap is the sum of every bias you failed to remove, finally presenting its bill.

Why most published backtests are fiction

Now stack the incentives. Anyone selling a strategy, a signal service, or their own cleverness shows you the best curve they produced, never the four hundred variants that died in development — publication itself is a survivorship filter. Add frictionless fills, perfect data, in-sample parameters, and a flattering date range, and the published equity curve is best understood as marketing collateral that happens to have axes. Your own backtests deserve the same skepticism, because the same selection pressure operates inside one skull: you remember the version that worked. The honest stance is to treat every backtest — yours included — as a hypothesis that has merely survived preliminary screening, with the real test always ahead, in data that does not exist yet.

Key takeaways

Lookahead bias hides in timestamps: for every input, ask when the value became knowable, not what it was.
Survivorship bias flatters dip-buying most — test against point-in-time universes that include the delisted and the dead.
Overfitting is invisible in the equity curve; it lives in how many variants you tried. Real edges are parameter plateaus, not pinnacles.
Fee arithmetic kills strategies before markets get the chance: a daily round trip at 0.25% taker costs roughly 125% of capital per year.
Out-of-sample data may be used exactly once; the second look makes it in-sample. Walk-forward and small-size live trading are the honest finals.
Published backtests pass through a survivorship filter called publication — assume fiction until proven otherwise, including your own.

PreviousFrom clicks to code: trading the API NextBuilding a systematic strategy