Coinbase Logo

Language and region

Monthly Outlook: What Machine Learning Models Tell Us about Crypto

Has crypto market performance in 1H24 been driven more by macro factors or fundamentals?

July 9, 2024

Default Article Image

Key takeaways

  • We employ two commonly used machine learning algorithms - linear regression and random forest models - to help us determine what the most important features in 1H24 have been.
  • Analysis of the last six months shows the key drivers of returns for BTC, ETH and SOL were predominantly token-specific metrics, albeit macro factors played a role as well.

Written by

  • David Duong, CFA - Head of Institutional Research

Introduction

Linear regression and random forest models are two commonly used machine learning algorithms that help provide valuable insights into the factors influencing cryptocurrency price movements. The difference is that where linear regression models provide a relatively straightforward interpretation of how certain factors directly impact cryptocurrency values, random forest models capture the intricate, non-linear relationships and interactions between various input factors. Ultimately, our goal is to determine whether the relationships that may have driven crypto performance in the first half of 2024 could be relevant in the second half of this year.

Our results suggest that over the last six months, the core drivers of BTC, ETH and SOL performance have been predominantly unique to those tokens, such as network fees and total value locked – or spot ETF flows in the case of bitcoin. That is, despite higher correlations between crypto and traditional asset classes in recent months, the most salient variables influencing crypto returns still tend to be more idiosyncratic in nature. This suggests that cryptocurrencies offer important portfolio diversification benefits, such as mitigating risk and enhancing overall returns, per our previous work on this topic.

However, there are some things that we weren’t able to fully capture in our models, such as shifts in the US regulatory environment. For example, we think the House of Representatives’ decision to approve the Financial Innovation and Technology for the 21st Century Act (FIT21) may have contributed to a change in market sentiment earlier this year. With the US elections likely to be an important driver in 2H24, this could be a notable omission.

Background

We utilize two types of feature importance charts in this report for each of the cryptocurrencies BTC, ETH and SOL. Both models rely on machine learning algorithms to help us identify which variables best explain crypto’s performance in the first half of 2024 (see our Appendix for the full list of factors tested in our models):

  • Our first model shows the coefficients of a commonly used multivariate linear regression analysis, indicating how much different exogenous and endogenous factors may have contributed to spot price over the last six months within a linear framework
  • Our random forests model is by comparison non-linear and generates Shapley Additive Explanations (SHAP values) that tell us how important each factor was in making price predictions over the last six months based on an iterative decision tree analysis

The results of our linear regression model are generated by scaling 18 to 32 different factors and finding the line that best fits the data points. Because we standardize the features in our model to unit variance, the coefficients should be interpreted as the change in the USD price of either BTC, ETH or SOL for a one standard deviation change in a given feature. The results are thus a fairly straightforward way to understand how each factor has directly impacted a given cryptocurrency's value.

By contrast, random forest is a much more complex model composed of multiple decision trees that captures intricate relationships and interactions between various input factors. It takes a global view of feature importance based on the entire dataset, and most importantly minimizes the problem of multicollinearity (see footnote 1). Ideally, if two factors in our model happen to be correlated, the model selects one and reduces the relative importance of the other, though this is not always guaranteed (which we explain in our Appendix). We first explored the use of random forest models in our April 2022 monthly outlook report, which covered our methodology in more detail.

Reading the results of these models should be fairly intuitive. We have removed directionality from our outputs in order to focus only on the magnitude of the impact that each feature has on crypto prices. Effectively, a larger SHAP value or coefficient implies a larger impact on the crypto price.

The major difference is that SHAP values consider how different combinations of features together influence price. The trade off is that not only is this a computationally expensive process, but the relationships between variables may be hard to interpret. Comparatively, our regression coefficients are based on a straightforward, linear framework, but the tradeoff is that this may not accurately capture the nature of the relationships between different factors and the crypto price.

Bitcoin (BTC) Results

It should surprise relatively few that both of our models identified US spot bitcoin ETF flows as the most important feature driving bitcoin performance over the past six months. In fact, our random forests model suggests the relative importance of flows to performance is nearly 2.83x greater than the second most important feature, commodity prices. Since the ETFs were launched in January, they have attracted US$14.8B in net inflows and total AUM has climbed to $49.1B.

Screenshot 2024-07-08 at 7.32.25 PM
Screenshot 2024-07-08 at 7.37.06 PM

Note that on a tick by tick basis, however, it has appeared in recent months as if these flows have been inconsistent with the price action, particularly on the upside. In our Midyear Review: Crypto Markets in 10 Charts, we explained that starting in April, many of these ETF flows started to be offset by short CME futures positions, which we suspect has neutralized the effect on price. Moreover, the BTC under custody by these ETF issuers have remained range bound between 825-850k since that time.

Interestingly, our linear regression model indicates that idiosyncratic features like active addresses and network fees are the second and third most relevant drivers for bitcoin performance after flows. That is, this model suggests these features have direct, linear effects on bitcoin prices. This diverges from our random forests model, which indicates commodity prices are the second most important feature, followed by fees. We think this suggests that commodity prices may be a good proxy indicator for the broader economic conditions impacting bitcoin price, more so than the other macro variables tested in our model, such as US equities.

Ether (ETH) Results

Our random forest model for ETH seems to select features that are more endogenous to the Ethereum network than our corresponding linear regression model. Indeed, the feature with the largest SHAP value is total value locked or TVL (normalized by the price appreciation of ETH), whereas the most important feature in our linear regression model is the performance of US investment grade bonds– which may reflect the effects of broader macroeconomic conditions such as interest rates, economic growth, and/or inflation expectations. That said, TVL is the second most important feature in our linear regression model followed by fees, suggesting idiosyncratic factors are still very important overall.

Screenshot 2024-07-08 at 7.32.51 PM
Screenshot 2024-07-08 at 7.33.09 PM

The SHAP value of TVL is over 5x greater than the next highest feature in our random forest model –  total ETH staked – which we included as a proxy for the inactive supply of ETH. This large difference in magnitude suggests that increases or decreases in TVL on the Ethereum network likely have a much stronger impact on the model’s predictions for ETH price than other factors, based on the last six months of data.

However, what’s not captured by these results is reflexivity. Our interpretation is that as more value is locked in the Ethereum network, this likely demonstrates greater usage and adoption of its applications, thus increasing the value of ETH itself. But equally, an increase in the ETH price also increases the value of the ETH collateral sitting in the decentralized finance (DeFi) protocols that make up the bulk of the Ethereum ecosystem. That in turn could increase overall (ETH-denominated) TVL as users may borrow assets to further deploy in other protocols. With the spot ETH ETFs expected to launch within the next few weeks, we thus think this could strengthen the relevance of TVL as a predictor of short-to-medium term ETH price movements, given the positive effect that flows could have on price.

Solana (SOL) Results

Compared to our ETH models, our machine learning models for SOL show more consistency with each other in terms of identifying similar influential features affecting the token price. Among the top three features in both our linear regression and random forest models are the multilateral USD index (DXY) and the total value locked on Solana. Both DXY and Solana TVL appear to have a linear relationship with SOL price as well as potentially capture nonlinear interactions, which explains their consistent importance across different machine learning techniques.

Screenshot 2024-07-08 at 7.45.01 PM
Screenshot 2024-07-08 at 7.33.41 PM

Cryptocurrencies are typically priced in USD (or their stablecoin equivalent) on most exchanges, so prices tend to be inversely correlated to movements in the DXY index. A linear relationship between changes in the DXY and changes in SOL price thus makes sense, but the DXY also tends to encapsulate a myriad of other macroeconomic and sociopolitical conditions that have the potential to influence investor sentiment and risk appetite. However, our random forest model indicates that the value locked into Solana contracts has a greater impact on SOL’s price expectations, likely highlighting the meaningful growth observed in the ecosystem year-to-date.

Indeed, the TVL on Solana has more than doubled from SOL13.2M ($1.5B) at the start of the year to SOL27.4M ($3.4B) as of end-June. This has partly had to do with the resurgence of Solana’s DeFi ecosystem during the latest crypto market cycle, led by a concentration of airdrop and memecoin activity. The low fees and high throughput of Solana has made it an attractive platform for memecoin projects, which in turn has attracted more users to the network – consequently boosting TVL as more assets are locked in various DeFi protocols.

Conclusion

Our model results suggest that although the macro environment continues to be relevant for crypto performance, fundamentals like network activity and user adoption are likely to remain dominant forces for the market in 2H24. This is consistent with our view that the recent price action in June had as much to do with a lack of directional macro momentum as it did with technical factors, like token supply overhangs. Given an absence of internal crypto narratives in the short term, we expect the price action to remain choppy through 3Q24 but believe the start of Fed rate cuts in September will support more interest in long duration assets in 4Q24.

However, one aspect of the market outlook that’s hard to capture in our models is investor sentiment, which could be affected by a shifting US regulatory environment. For example, recall that in late June, the US Supreme Court ruled on two cases that overturned Chevron Deference, which could affect the authority of executive agencies. Indeed, while we identified the upcoming US elections as an important potential driver for crypto markets, especially as we get closer to November, we had trouble adequately representing this dynamic quality within our quantitative frameworks.

Appendix

Explanations

We use both multivariate linear regression and random forests models in this report. Multivariate linear regression is a statistical analysis technique we used to explain the relationship between the token price (of BTC, ETH and SOL) and several independent endogenous and exogenous crypto variables. It works by fitting a linear equation to observed data, under the assumption that there’s a linear relationship between the daily price returns and the changes in these independent variables.

What are random forests? To answer this question, we first need to talk about decision trees, which are the building blocks of random forests. Essentially, decision tree analysis is an iterative process that takes a dataset and splits the data into smaller subsets (branches) according to the relevance that certain information has for predicting outcomes. Random forests create a pool of decision trees to maximize information gain and produce “feature importance” values, which are summed across this collection of decision trees.

For our purposes, tree splits in our dataset are characterized by their mean decrease in impurity (variance.) That is, our model is designed for a regression problem (as opposed to a classification problem) and “learns” by computing how much a given variable contributes to reducing weighted impurity.

Notably, multicollinearity should be minimized but may not be fully eliminated in a random forests model. That is, if we have two features A and B that are highly correlated, one tree might prioritize A and another tree might prioritize B based on sampling. With enough "trees", the impact of factors A and B in the final model will be distributed across the two. This distribution could be split evenly, skewed heavily towards A, skewed heavily towards B or anywhere in between. This split is unknown without further analysis and may impact the interpretation of results. Nevertheless, we try to minimize the issue of biased feature selection by fitting our model to the data through an iterative process of manually including and excluding variables and evaluating the strength of the outcomes.

For a good online resource to better understand random forests, please see this content from the creator of random forests, Leo Breiman.

Model inputs

The inputs to our model include:

  • broad macroeconomic variables like economic surprise indices for U.S., Europe and China; U.S. 2Y and 10Y Treasury rates as well as U.S. breakeven inflation
  • other asset classes including the S&P 500, commodities, investment grade corporate bonds, DXY (multilateral USD index) and gold
  • sentiment indicators like the VIX index (for stocks) and the MOVE index (for bonds)
  • crypto specific factors like bitcoin hash rates, bitcoin ETF flows, active addresses, transaction counts, gas fees, supply growth (inflation), supply staked and total value locked

Note that the ability to back test in this space can be complicated by the differences in the features’ market dynamics. For example, because crypto markets run 24/7, we need to use the right timestamp on their prices corresponding to the end-of-day prices of other asset classes.

Interpretation of SHAP values

The importance of a given input or “feature” is defined here by its strength in explaining the observed price movements in BTC, ETH, SOL and AVAX. For our random forests model, we employ a method known as Shapley Additive Explanations (SHAP) to quantify the relative importance of our inputs. The appeal of SHAP values is that the model results can be explained in a relatively straightforward way. SHAP values help us understand the impact of having a certain value for a given feature in comparison to the prediction we'd make if that feature took some baseline value.

SHAP values that tend towards zero reflect features of little predictive relevance in our model while values moving away from zero reflect features of increasingly higher importance. But what this measure is actually identifying are which inputs best refine (or “split”) our decision tree(s) along the most meaningful dimensions in order to learn the key relationships that could affect our predicted outcomes.

newsletter.png

Sign up for our insights

Get the latest market insights, developments and updates, direct to your inbox.