HOW TO SPOT BACKTESTOVERFITTINGDavid H. BaileyLawrence Berkeley National Lab (retired), andUniversity of California, DavisMarcos López de PradoGuggenheim Partners, LLCin collaboration withJonathan M. Borwein, University of Newcastle, AustraliaQiji Jim Zhu, Western Michigan University1
Key points· Backtests (i.e., historical simulations of performance) are widely employedto test and operate investment strategies.· If the researcher tries a large enough number of strategy configurations, abacktest can always be fit to any desired performance for a fixed samplelength. Thus, there is a minimum backtest length (MinBTL) that should berequired for a given number of trials.· Standard statistical techniques designed to prevent regression overfitting,such as hold-out, are ineffective in the context of backtest evaluation.· Under memory effects, overfitting may lead to systematic losses.· Overfitting is just one example of the misuse of mathematical and statisticalmethods applied to finance.· Since most published backtests do not report the number of trials involved,many are overfit.“I remember my friend Johnny von Neumann used to say, with fourparameters I can fit an elephant, and with five I can make him wiggle histrunk.” [Enrico Fermi, 1953]2
Backtesting· A backtest is a historical simulation of an algorithmic investmentstrategy.· Among other results, it computes the series of profits and losses thatsuch strategy would have generated, should that algorithm had been runover a specified time period.Example of a backtested strategy è ""The green line plots the performance of atradable security, while the blue line plotsthe performance achieved by buying andselling that security. Sharpe ratio is 1.77,with 46.21 trades per year. Note the lowcorrelation between the strategy’sperformance and the security’s."3
Reasons for backtesting investment strategies· The information contained in the reported series of profits and lossesmay be summarized in popular performance metrics, such as theSharpe Ratio (SR).· These metrics are essential to select optimal parametercombinations: Calibration frequency, risk limits, entry thresholds,stop losses, profit taking, etc.Optimizing two parameters generates a 3Dsurface, which can be plotted as a heat-map– see graph è ""The x-axis tries different entry thresholds,while the y-axis tries different exit thresholds. ""The spectrum closer to green indicates theregion of optimal in-sample Sharpe Ratio."4
DANGER AHEADSupercomputers and high-tech mathematical finance algorithms cangenerate nonsense faster than ever before!The principal danger is statistical overfitting of backtest data:· When a computer can analyze thousands or millions of variations of agiven strategy, it is almost certain that the best such strategy, measuredby backtests, will be overfit (and thus of dubious value).· Many studies claim profitable investment strategies, but their results arebased only on in-sample (IS) statistics, with no out-of-sample (OOS)testing.· Overfitting is the most common reason that mathematical investmentschemes look great in backtests, but then fall flat in the real world.· and yet, most backtesting software does not control for the probabilityof backtest overfitting!5
The hold-out method to test an investmentstrategy (not very Investmentstrategy"No"IS Perform.Evaluation"Is stmentstrategy"OOSPerform.Evaluation"6Optimal ModelConfiguration" single point of High Variance ofdecisionError!
How easy is it to overfit a tely, too easy!""For instance, if only 5 years ofdata are available, no morethan 45 independent modelconfigurations should be tried.For that number of trials, theexpected max IS SR 1,whereas the expected OOSSR 0."1286M inBT L 4(1)Z1 1N1 ZE[maxN ]1 111Ne (N)After trying only 7 independent strategy configurations, the expected maximum IS SR 1 for a 2-year-long backtest, while the expected OOS SR 0."!Therefore, a backtest that does not report the number of trials N used to identify theselected configuration makes it impossible to assess the risk of overfitting.""Overfitting makes any Sharpe ratio achievable in-sample: The researcher just needsto keep trying alternative parameters for that strategy!!7
Overfitting in the absence of memoryThe left figure shows the relation between SR IS (x-axis) and SR OOS (y-axis).Because the process follows a random walk, the scatter plot has a circular shapecentered in the point (0,0).""The right figure illustrates what happens once we add a “model selection”procedure. Now the SR IS ranges from 1.2 to 2.6, and it is centered around 1.7.Although the backtest for the selected model generates the expectation of a 1.7SR, the expected SR OOS is unchanged around 0."8
Overfitting in the absence of memory (cont.)This figure shows whathappens when we select therandom walk with highest SRin-sample (IS).""The performance of the firsthalf was optimized (IS), andthe performance of the secondhalf is what the investorreceives out-of-sample (OOS).""The good news is, in theabsence of memory there isno reason to expect overfittingto induce negativeperformance."In-Sample (IS)"9Out-Of-Sample (OOS)"
Overfitting in the presence of memoryMemory effects may cause OOSperformance to be negative, eventhough the underlying process wastrendless.""Also, a strongly negative linearrelation between performance IS andOOS may arise, indicating that themore we optimize in-sample, theworse is OOS performance."""Conclusion:"When financial analysts do not controlfor overfitting, “Past performance isnot an indicator of futureperformance” is too optimistic! Goodbacktest performance may be anindicator of negative future results."The p-values associated with the intercept and the in-sampleperformance (SR a priori) are respectively 0.5005 and 0,indicating that the negative linear relation between IS andOOS Sharpe ratios is statistically significant."10
Tools to prevent backtest overfitting1. Compute the probability of backtest overfitting, using a formula given inour paper “The probability of backtest overfitting,” available athttp://ssrn.com/abstract 2326253 or http://www.financial-math.org.2. Compute performance degradation and probability of loss (also given inthe above paper).3. Apply the theory of stochastic dominance, which allows us to rankinvestment strategies without having to make assumptions regarding anindividual’s utility function (see above paper for details).4. Perform model sequestration: Announce a proposed investment strategyto others (either publicly, or within a firm), then subsequently publish theresults of using this strategy for a pre-specified period of time.§ See D. Leinweber and K. Sisk, “Event Driven Trading and the ‘New News’,”Journal of Portfolio Management, vol. 38(1), pg. 110-124.11
Reproducibility in financeUsing rigorous methods in mathematical finance (e.g., to prevent backtestoverfitting) enhances reproducibility and reliability:· Many other scientific disciplines are facing similar issues ofreproducibility, to overcome the bias of only publishing “good” results.· There is a growing movement in the pharmaceutical industry to requirethe results of all prototype drug tests to be made public. Seehttp://www.alltrials.net.· Johnson & Johnson recently announced it will make all test results public.· Mathematicians and computer scientists are setting standards forreproducibility in the field of scientific computing. See:§ V. Stodden, D. Bailey, J. Borwein, E. LeVeque, W. Rider, and W. Stein,“Setting the default to reproducible: Reproduciblity in computational andexperimental mathematics,” February 2013, available rt.pdf.12
An absurd investment program· An investment advisor initially sends 10,240 letters to prospective clients. In 5120 ofthese letters, she predicts that a certain set of securities will go up; in the other 5120she predicts they will go down.· One month later, if the securities have gone up, she sends another letter to the first5120 and ignores the second 5120 (or the reverse if the securities have gone down).In 2560 of these letters, she predicts the securities will go up; in the other 2560, shepredicts the securities will go down.· One month later, if the securities have gone up, she sends another letter to the first2560 and ignores the second 2560 (or the reverse if the securities have gone down).In 1280 of these letters, she predicts the securities will go up; in the other 1280, shepredicts the securities will go down. This is repeated for ten months.· After ten months, the remaining 10 investors, astounded by the advisor’s uncannyprophetic powers to date, will entrust all their money to her.Clearly this is an absurd, even fraudulent investment program, becauseinvestors are never told of the many other failed recommendations.But why is backtest overfitting, where one does not disclose how many modelswere tested, any different?13
Why the silence in the mathematicalfinance community?· Historically scientists have led the way in exposing those who utilizepseudoscience to extract a commercial benefit – i.e., in the 18th century,physicists exposed the nonsense of astrologers.· Yet financial mathematicians in the 21st century have remaineddisappointingly silent with those who, knowingly or not:§ § § § § Fail to disclose the number of models that were used to develop a scheme(i.e., backtest overfitting).Make vague predictions that do not permit rigorous testing and falsification.Misuse charts and graphs: “Beware of fund managers bearing double y-axes.”See Matthew Obrien’s article on the “scary chart” in the Atlantic (11 Feb 2014).Misuse probability theory, statistics and stochastic calculus.Misuse technical jargon: “stochastic oscillators,” “Fibonacci ratios,” “cycles,”“Elliot wave,” “Golden ratio,” “parabolic SAR,” “pivot point,” “momentum”, andothers in the context of finance.· Our silence is consent, making us accomplices in these abuses.“One has to be aware now that mathematics can be misused and that wehave to protect its good name.” – Andrew Wiles, New York Times, 4 Oct 2013.14
Mathematicians Against Fraudulent Financial andInvestment Advice (MAFFIA)http://www.financial-math.org(main site)http://www.m-a-f-f-i-a.org(alias to main site)http://www.financial-math.org/blog/(blog)The principal purpose is education, not confrontation – helping readersrecognize and avoid fallacies and abuses in mathematical finance.For full technical details on the material in this talk:· “Pseudo-mathematics and financial charlatanism: The effects of backtestoverfitting on out-of-sample performance,” Notices of the American MathematicalSociety, to appear (May 2014), http://ssrn.com/abstract 2308659.· “The probability of backtest overfitting,” manuscript, 10 Feb 2014, available athttp://ssrn.com/abstract 2326253.· These papers (and this talk) are also available at http://www.financial-math.org.THANK YOU!15
backtest can always be fit to any desired performance for a fixed sample length. Thus, there is a minimum backtest length (MinBTL) that should be required for a given number of trials. ! Standard statistical techniques designed to prevent regression overfitting, such as hold-out, are ineffective in the context of backtest evaluation. !