Over the last two weeks I’ve been getting my hands dirty backtesting options strategies and collaborating with a few readers on the findings.
At face value it‘s a simple idea: set the position open and close criteria then click “Go.” For those with an IT background, saying “backtest a strategy” is comparable to saying “image a computer.” Backtesting, like imaging a computer, is very much a process and not a task. As such, backtesting inherits all the attributes and challenges of a process and follows the same workflow: design, build, test, run, validate, improve.
In this post we’ll explore the process of backtesting and highlight elements of the learning curve to successfully execute a backtest.
Options, like securities, have a unique identifier called a CUSIP which is an acronym for Committee on Uniform Security Identification Procedures. Each unique option configuration has its own CUSIP. For example, a SPY 290 PUT strike with an expiration on July 5 2019 has a unique CUSIP, which is different from a SPY PUT 289.50 strike with an expiration of July 5 2019, which is different from a SPY CALL 289.50 strike with an expiration of July 5 2019. Every strike price for every underlying for each expiration date for call and put has a unique CUSIP.
Options have exponentially more CUSIPs than securities. For example, SPY has 1 CUSIP while SPY options have more than 13,600 (over 200 strikes * 2 instruments [call and put] * 34 or more different active expiration dates) CUSIPs.
We then take each of the 13,600 CUSIPs and associate an end-of-day price for each. If we increase our resolution from a single end-of-day pricing to 1-minute intraday granularity, the finest resolution Cboe offers for download, multiply by 540 (9 hours [9am – 6pm options trading hours for SPY] * 60 minutes) for a total of over 7,344,000 SPY option prices per day.
Actually there are twice as many – the bid price and the ask price. Midpoint may or may not be the price at which an option trades and would be a derived 3rd value for each minute of each CUSIP.
Of course, we want to run a backtest over the longest span of time as possible and not just 1 day. Multiply the over 7.344 million records by 3024 (12 years * 252 trading days/year) to give us over 22.2 billion records from Jan 3 2007 through Dec 31 2018 to backtest only SPY. Want to compare VIX spot price against SPY throughout the backtest? Double to 44.4 billion records.
In part because of this scale and velocity of new data, free or open-source tools do not exist to backtest options. Equities have far fewer permutations and historic data is far more manageable allowing free tools like Portfolio Visualizer to exist to facilitate securities backtesting.
There are two ways to backtest options: buy data and self analyse or subscribe to an option backtesting tool.
Buy Data and Self Analyze
The data logistics of sourcing, storing and accessing over 22 billion records per underlying security is an appreciable and expensive IT problem to solve at a local level. Let’s simplify by renting an IT infrastructure such as AWS.
Using an estimated 6TB [compressed] data requirement and Amazon S3 storage costs of $.023/GB/month for the first 50TB, we come to $141.31/month (6*1024*.023) to simply park the data on a datacenter somewhere.
Let’s factor in the costs to query this data.
According to Amazon Athena pricing we’d be looking at $5 per 1TB of data scanned.
I’m going to take a SWAG (scientific wild-ass guess) and say the 6TB consists of 6 underlying positions each 1TB. Continuing the SWAG, we’ll need half of the 12 attributes [per Cobe sample file] in each purchased file to run a backtest and each attribute has identical data sizing.
Running a single backtest with SPY and VIX would require scanning 1TB (1TB per ticker * 2 tickers * 0.5 of the fields in each ticker file) and cost $5.
The results of this query will generate a trade log for each backtest. If we want to run a simple backtest of naked SPY puts with the following configurations:
- 5 delta, 10 delta, 16 delta, 30 delta, ATM, 1-10% OTM, 1-2% ITM (17 unique open criteria)
- 25% max profit or 21 DTE
- 50% max profit or 21 DTE
- 75% max profit
We’ll end up with 68 backtests (17 unique open criteria * 4 unique close criteria) @ 1TB each * $5 = $340. As I’ll explain later, the learning curve to generate 68 viable backtests will push this number ~50% higher to 102. Let’s call it $510.
Once we have the 68 trade logs we’ll need to compute stats like max drawdown, sharpe ratio and win rate for each log and build a performance curve overlaid against a benchmark such as SPY total returns. This can be performed locally in Excel at no explicit cost.
Buy and Analyze Summary
This of course all requires a bit of IT savvy with skills in infrastructure and data-structure concepts, software development in a query-oriented language such as SQL and a decent amount of time to dedicate to the activities.
Cobe data is $275 for EOD pricing per ticker up to 10, after which “all” tickers can be had for $2750. For 1-minute resolution the tickers are $1100 and requesting all costs $38.5k. If we add greeks and other calculations to the 1-minute data SPY sells just under $1700, with all tickers costing just over $59k.
Suppose we buy only SPY and VIX EOD pricing data for $550 @ 1TB each. Add in 47/month in storage for 2TB, $510 for the 68 backtests listed above and say 80 hours @ 75/hr over 4 weeks (working half days) in software development learning curve, building, testing and refining (which also have Amazon query costs not factored here) to produce the trade logs and Excel calculations and we have $5607 for the estimated 1-month cost.
Subscribe to a Backtesting Tool
Instead let’s consider outsourcing the logistics to an entity that already solved this problem. My testing of platforms consisted of OptionStack and ORATS. I wrote about my experience using OptionStack in my last post.
Below is a table of the features offered between these two platforms and an offering from OptionAlpha.
ORATS has the lowest barrier to entry of the backtest approaches explored. The lack of build-your-own logic helps make this tool very turnkey at the cost of limited logic design for trade entry and exit. For example, there is no way to have ORATS open trades on specific days of the week. If this is a key part of the backtest strategy you’ll probably want to explore OptionStack. Hence, I was once again thwarted in my attempt to formally backtest BigERN’s strategy.
Update June 17 2020: I was finally able to backtest BigERN’s strategy
The process is essentially two steps: build a tradelog from raw historic data then analyze the log calculating statistics such as win rate, P/L, etc.
Transcribe a written strategy into a backtesting tool.
This means identifying the logic flow to scan raw Cobe data or identifying the knobs and levers available in each backtesting tool to manipulate. Questions like “What does opening a position M, W, F look like algorithmically and/or in the backtester?” are answered here.
Write the code and/or build the logic in the backtester.
OptionStack, having a lego-like blank canvas for logic design, has a learning curve but allows for a virtually limitless trade design. They offer several examples of working logic designs that can be referenced to expedite the learning curve in the build process.
ORATS has clearly-labeled and documented entry and exit settings listed in the left column that can be easily set.
Run code against a subset of the historic data to ensure logic is accurate and/or debug.
If writing code or using OptionStack, the opportunity to test logic against 1 month of historic data without counting against the monthly backtest limit exists. This allows us to validate logic while limiting Amazon query costs and time spent on debugging.
ORATS does not have this feature (each backtest counts against monthly allotments). However, it can be argued it’s not necessary due to the non-blank-slate nature of building trade logic in their tool.
Run the code or click “Run” and execute against the full range of dates desired.
I haven’t tested running queries in AWS so I’m not familiar with how time intensive this activity is.
OptionStack executed each backtest of 1 year in about 60 seconds on my 2013 MAcBook Air with 8GB RAM, i7 CPU and 512GB SSD. Remember, this is run locally in the browser and is CPU and RAM intensive.
ORATS executed backtests from 2007 to present in about 3-6 minutes depending on whether the job was submitted during off-peak or business hours, respectively.
Review the trade log and strategy statistics for irrational outputs.
As I mentioned in my last post, OptionStack results appear to be inaccurate. On the flip side, they have excellent trade logs that adhere to data structure best practices. The console logs provide useful output that highlights when trade logic triggers but a trade is not placed due to no positions matching subsequent logic criteria. With a little effort I was able to build some spreadsheets that generate key statistics based on the OptionStack trade logs.
ORATS appears to have accurate trade stats but their trade logs don’t adhere to data structure best practices. For example, selling a position and buying a position back are both on the same line with some math calculated. Best practices would list the subsequent buyback events as a separate transaction elsewhere in the log. Closing a position is a unique transaction and should be recorded as such.
Another limitation of ORATS is the inability to advise when trade logic executes but no matching positions are found. For example, if the backtest intention is to sell 10 delta puts on SPY but there are no 10 delta positions available on a given day (there may be 10.2 or 9.9 delta positions), ORATS does not speak up when this happens, sometimes severely limiting the number trades that occur in a backtest and subsequently skewing the results. More on this in a minute.
Adjust the logic parameters to capture the intent of the backtest.
As the backtest results are generated and abnormalities such as fewer trades than expected or largest loss between tests manifests, a deep dive into the trade logs is necessary to see what happened and understand the “why” behind the discrepancy. With OptionStack abnormalities were a matter of inaccurate statistics calculations. With ORATS it was the silently omitted trades.
You may have noticed in the ORATS screenshot the strike selection value for the backtest is actually a range. It targets 10% OTM but accepts values from 9.5 to 10.5 OTM, choosing the closest to 10. The intent of this backtest is to review the performance of selling a 10% SPY put daily and exiting based on different management approaches.
If we are targeting a specific delta for backtesting why use a range?
One word: intent.
The actual number of occurrences where a position is exactly the target value is surprisingly small. When I was backtesting SPY across different deltas and OTM ranges the total trade occurrences were as much as 25% fewer than expected when targeting specific values as opposed to ranges.
Here’s an after-hours screenshot from Interactive Brokers on May 28 2019 depicting SPY deltas. VIX closed at 17.50. Suppose we wanted to target 16 delta positions in a backtest.
SPY is 280.48. Each $1 strike increment is a 0.35% adjustment. If the underlying notional is smaller, such as MU (Micron) at $33, each $1 increment is 3.03%. Some options chains have $0.50 increments on some (MU) or all (IWM) near the money strikes easing the gap. Nevertheless, the impact is amplified as the strike increment represents a larger % of the underlying.
As VIX (or IV for a specific stock) increases the issue is heightened as the range between each strike’s delta is widened. During the most volatile periods, the time when trade performance is most important, the majority of trades are skipped as deltas can vary significantly between strikes. This is partially offset by 16 delta strikes becoming further OTM and, as a principle, the deltas of farther OTM strikes are typically closer together than near-the-money strikes.
The first thought I had was “How are other people solving this challenge?” I know TastyTrade does regular backtesting. Unfortunately they don’t speak to this concept or any specific details of methodology beyond the highest of high level.
I’m left with two choices: don’t adjust or define a range.
Since the idea is to test the intent of delta performance and not exact delta performance, I chose to define a range.
What’s the methodology I used to define a range?
I prefer a more formal approach but empirical seemed to do the job with minimal effort. I ran backtests with a specified range and reviewed the count of trades.
When anchoring trade entry logic to delta, if the total number of trades was less than 95% of the expected number of trades I’d widen the delta range by 0.5. This process would be repeated until the count of trades was 95% of the expected number or higher.
When anchoring trade entry logic to % OTM, this was a mostly a non-issue for two reasons:
- distance from spot price is not impacted by anything beyond the size of the strike increments
- all increments between 2% ITM and 10% OTM were tested using 1% increments +/- 0.5%. For example, desiring a target of 8% OTM I would select the strike between 7.5-8.5% OTM closest to 8%.
- Any positions missed by one backtest would be picked up by another. Since the performance of backtests are reviewed in aggregate (i.e. I’m not hanging my hat on a single strategy in isolation) any trades missed by 8% OTM will be caught by the 9% or 7% OTM backtest.
When we first started backtesting one of the mechanics involved opening a single position then opening a new one immediately after the first one was closed. This seemed like a reasonable strategy to test and we ran a 5 delta and 16 delta test on SPY using 50% max profit exit mechanics. After reviewing the stats the 5 delta strategy had a “worst trade” stat of more than double the 16 delta strategy, which didn’t make sense.
I reviewed the trade logs of each strategy and noticed the 16 delta approach managed to skate past the worst of the financial crisis due to “lucky” timing.
Visually, this is what happened:
Every TastyTrade study that starts with: “we opened a position every month…” is subject to this backtesting “gotcha” and is at risk of having significantly skewed outputs and takeaways as a result. The same applies, albeit to a lesser extent, to trades opened “every week”.
To approach the true performance characteristics of a trading strategy we need to open a position no less than daily. Ideally 1-minute granularity would allows us the best insights into strategy performance but none of the backtesters offered intraday logs.
This introduces a new challenge: testing fixed-capital-allocation strategies.
If the intent of a backtest strategy is to measure performance of a fixed amount of capital or to set a risk ceiling, things get complicated. We can:
- accept the potentially-misleading results due to [un]luckey timing
- manually stitch together different cohorts of trades such that capital exposure remains constant, but we then are at risk of being accused of cherry picking the best / worst results based on the timing of the initial trade and the argument we’re trying to make
- use a fixed offset and run the backtest using all of the offsets to identify a strategy performance variance attributed to timing luck (this is what the authors of the referenced SSRN paper did in their study – portfolio “tranches” is the phrase they use).