Explaining vs. Predicting in practice

7 min readSep 12, 2021

The 2010 paper “To Explain or to Predict” by Galit Shmueli makes an interesting statement in its abstract:

Conflation between explanation and prediction is common, yet the distinction must be understood […]

In this story we deep-dive from a practical side: What exactly is the difference? And must we really care all that much as datascientists?

Most data scientists and analytical people go through their education in statistics and machine learning (ML) without too much focus on the field’s philosophical underpinnings. But knowing some fundamental principles about your profession and daily tools can be valuable. So let’s have a look at this one…

Explaining vs. predicting. There are some fundamental differences between the two when it comes to methodology and using algorithms. Knowing these differences will help you pick the right tools in the right moment and avoid confusion. This can be useful in three ways:

To consciously decide what method is required for your problem.
To understand what another analyst might be getting wrong
To know if the difference matters at all for your job

Classical Stats vs. Machine-Learning?

You might already know one cliché difference between the two: Predicting is what “ML-people” do. Explaining is what “Economists and Sociologists” do. If you’re a data scientist and have friends with an economics or sociology background you can sometimes tell the big difference between how they learned and use statistics vs. how a modern datascience curriculum covers the topics. But let’s dig deeper.

Predicting aims to target future events by looking at and learning from past events, with the main focus always on stable prediction of future events. Explaining, on the other hand, is focused on revealing causal relationships from observations done in the past, with a focus on the systems and interactions at play.

And yet the two are often conflated. People believe a model which is good at explaining past behavior must immediately be good at predicting future behavior, as well. Unfortunately that’s not always the case.

Let’s go into the differences on a high level and uncover for both prediction and explanation the underlying

“why”: Reasons for choosing or doing explanation vs. prediction
“how”: What’s the typical process when doing explanation vs. prediction
“what”: What do typical outputs or results look like for the two?

The why: Aiming at Hypotheses vs. Individual Cases

Explanatory modelling aims at confirming a causal theory consisting of one or multiple hypotheses. It answers questions such as “Can this theory be true, given the data?” or “Does smoking cause cancer?”. Examples of explanatory models are Newton’s theory of mechanics or Darwinian Evolution theory. They postulate clear hypotheses that can be tested by data.
When exploring data for explanatory modelling we look through a theory lens, focusing on these (upfront) specified causal relationships. We are mostly interested in data that is directly reflecting parts of the theory or is at least very closely related to it.
A recently popular subset of explanatory modelling is causal inference (popularized by Judea Pearl), providing more solid underpinnings to the goal of proofing cause and effect. And that’s exactly the main goal of explanatory modelling: to corroborate or to confirm cause and effect.

Predictive modelling aims at making good predictions about “unseen” events. It looks into the future, often disconnected from a theoretical underpinning and much more prospective.
When exploring data for predictive modelling, we are “mining the data in the dark”, looking for interesting relationships, anything that might help us make better predictions, but not necessarily specifying hypotheses upfront.

The “best” models, e.g. Newton’s theory of mechanics, do both explanation and prediction very well. They serve well to explain relationships between variables but at the same time perform very well in a predictive task.
Unfortunately, as settings for modelling become more involved and noisy (as often the case in Psychology, Marketing, Medicine, etc.) a model that adequately explains effects from a past study might not be adequate for predicting future events. On the other hand a well trained ML-model, producing solid predictions, might not be adequate for explaining causal relationships between variables in a way that helps shaping a clear theory.

The two types of models can also co-exist and cross-stimulate each other, e.g. strong effects identified by an explanatory model might make good features for a predictive model while high-importance features from a predictive model might point towards interesting future research directions with explanatory modelling. As theory and science progresses around a topic, the best explanatory models and the best predictive models should converge.

The how: Causal Graphs vs. Arbitrary Input-Output

Explaining usually happens in these 5 steps:

Coming up with a theory and corresponding hypotheses
Drawing corresponding causal diagrams or relations
Operationalising the diagram in terms of measurable variables
Applying statistical algorithms (mainly regression) on a datasets from a study to test the hypotheses and causal diagram
Reach a conclusion (and potentially refine by starting again at 1.)

Note how there is no test-data in this process. Normally, scientists work only on “the data” (usually collected during field work) and estimate goodness of fit using R² or similar metrics and then establishing significance using hypothesis testing. All of this happens on the same data. There is no train-test split.

Predicting has no upfront theory crafting and no operationalisation of variables. We learn directly from the data:

Discover data matching the output you’re interested in
Prepare the data by formatting inputs (X) and outputs (Y)
Train a general purpose algorithm (can be many different types, e.g. KNN, neural network or random forest) to learn the input-output relationship from a labeled data set.
Test the model on previously unseen data (and potentially refine by starting again at 1.)

Note how there is no causal diagrams involved, the data scientists usually directly go into data sourcing and data preprocessing. Theory crafting and any operationalising are implicit and minimal. The goal is a quick, simple prediction.

The what: Theory vs. Prediction

Both explaining and predicting usually done in your standard analytics environments, often coded using standard tools such as Python & R. But outputs are quite different.

Explaining will produce outputs such as causal diagrams, significance levels, accepted or rejected hypotheses, etc.

Typical output #2 (for explanation): A causal diagram drawn to help with theory & hypotheses

Typical output #2 (for explanation): Height is driven largely by age (large t-value), not by number of siblings.

Predicting on the other hand will produce as output a model we use for making predictions about new instances. It might also produce feature importance plots and cross-validation results to evaluate goodness of fit.

Input & Output for a new instance fed to the model in a prediction setting.

The summary: Not so different after all?

The original paper mentions 40+ differences between explaining and predicting. Most of those seem inconsequential for a datascientists’ daily life. I listed the differences the article makes in the table below. They are not so different after all… Let’s go through some of them (the ones highlighted in green).

Detailed list of differences between explanatory and predictive modelling.

Data Exploration: How is data explored?

Shmueli makes the case that data exploration for explanation is really focused only on the relationships postulated by the theory one wants to test. And that for prediction it’s more “interactive”.
But I’ve never done data exploration for prediction without an “implicit theory” in my mind about what could be interesting. In the end, in both cases you’re figuring out interesting patterns in the data based on a mental model of the system you’re investigating.

Disciplines: Who does prediction, who does explanation?

This is basically the good old Stats vs. ML controversy. As a datascientist I’ve always felt it’s a bit an artificial discussion. I consider both the same or at least twin brothers, especially since the statistics community has also been very much moving into “Computer Age Statistical Inference”, and closer to ML in recent years.

Method: How is it done?

The difference in methods were highlighted before. And indeed, this is really the largest difference between explaining and prediction in my opionion. The two core discrimination points:

Explaining is not using any test-set, while prediction is relying on test-set heavily during model validation.
Explaining with causal diagrams vs. “YOLO” feature-engineering in prediction

These two, and especially the first one, summarize the key difference the whole paper revolves about.

Purpose: Why is it done?

Although prediction is clearly different from explanation, we cannot get around the question of why we want to explain things in the first place. In most cases we explain to understand, and we want to understand to be able to predict future behavior.

Conclusion

Predictive and explanatory modelling are not so different after all. I see the Galit Shmueli paper as a push towards the scientific community to start leveraging ML and prediction methods (e.g. cross-validation) in their work, where overfitting and “test-error” seem to have been largely neglected it seems.

On the other hand I do believe it’s worth also to think about opportunities to leverage more explanatory modelling techniques on data scientists’ daily life. What’s important is to make a conscious choice at the start of projects with respect to the main goals: is it just prediction of an instance, or is it identifying some key effects? If the latter, one should clearly follow scientific processes including drawing causal diagrams and writing down proper hypotheses.

Sources

The Shmueli Paper:

To Explain or to Predict?

Statistical modeling is a powerful tool for developing and testing theories by way of causal explanation, prediction…

arxiv.org

Other sources:

Yarkoni, T. and Jacob Westfall. “Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning.” Perspectives on Psychological Science 12 (2017): 1100–1122.
https://medium.com/@cheahwen1997/weight-prediction-based-on-height-with-machine-learning-2069177e0510
Papers connected to the Shmueli paper: https://www.connectedpapers.com/main/a5eb4df59aae5d9ea061024e975072971c50d134/To-Explain-or-to-Predict/graph