Correlation is not causation
Suppose you are presented with the following headline:
“Studies show that children in households with more books tend to, later in life, join the labor market in higher-paying jobs and have higher lifetime earnings.”
Sensible enough observation. Presented with this information and being interested in the success of my own hypothetical children, what should I do?
i) Buy more books than I would have before I read this piece of news. After all, studies show that it leads to better outcomes later in life.
ii) Do nothing differently.
Please take a minute to think about these two options...
Yes, let me argue that answer (ii) is the course of action supported by the presented evidence.
This headline, although suggestive, presents purely correlational data. Just to be clear, I'm not disputing this observation at all, it seems perfectly plausible to me that this is the case. What doesn't follow is the heavily implied causality. (Additionally, to be clear again, I'm not disputing the possibility that additional books are a good thing, I'm precisely disputing the notion that this evidence supports this assertion.)
There's a very precise methodology which is the gold standard to isolate causal relationships: randomized double-blind controlled experiments. What was presented, on the other hand, was an after-the-fact analysis of priors (books present in the household) when the outcomes (adult-life earnings) were already known. These are two different beasts.
When we take posteriors as a selection criterion for inferring anything about priors, we can't discard the possibility that we fell for any of the following traps:
i) Inversion of cause and effect -- Priors and posteriors don't come with intrinsic labels; we have to assign them to the observations we are studying. This is something that we attribute, not something intrinsic to the data. In this case, though, it seems that adult-life outcomes are placed later in time relative to household books available during childhood. Therefore, we could reasonably discard the hypothesis that we are mistaking cause and effect.
ii) A third unconsidered factor that drives both observed effects. -- Now this is what I think is the mechanism at play here. Say we are studying and measuring thing A -- how many books there were in the household during childhood -- and thing B -- adult-life career earnings. By looking at the data alone, we find a positive correlation between A and B. If we look at the samples with higher A, this tends to be the ones with higher B. What's conspicuously absent here? Factor C: socioeconomic status of the household.
Now, by looking at A, B and C, it becomes clearer that there's a positive correlation between A and C, and a positive correlation between B and C.
Therefore, one can reasonably argue that, through these lens, A (more books in the household) is merely an outcome of C (higher socioeconomic household status); inasmuch as B (higher adult-life income) is also merely an outcome of C -- likely related to access to higher-quality education and/or higher-earning job opportunities.
Again, I'm not arguing that C is the driving factor here (that's a positive assertion). I'm arguing that just by looking at A and B you can't discard a C, whatever C may be!
These are some of the perils of correlational data. If you can't precisely control before-the-fact what's changing between the groups you're studying, you can't rule out that the effects you observe are a consequence of your selection of priors and posteriors.
Understanding how this works is essential to grasping the limits of what the evidence supports, what studies claim, and what studies can claim. Faulty understanding of these dynamics, amplified by media, can lead to -- although well-intentioned -- ineffective public policy, followed by public mistrust in scientific institutions.
---
PS: Here's another somewhat contentious issue: so-called “data-driven” return-to-office (RTO) policies.
Suppose you are the head of a 10-thousand-people organization. With the COVID-19 long in the rearview mirror, you are faced with the fact that (say) half of your workforce is working from home, and the other half is working at the office.
Reasonably enough, if everyone was either working from home or either working in the office, there would be less friction. For hybrid teams, there are some inefficiencies that arise -- though no one in particular is at fault -- purely as a product of in-office versus fully-remote dynamics.
To decide upon the best course of action, you then decide to make a survey. Two questions: 1. How many weekdays do you go to the office per week (1, 2, 3, 4 or 5 days a week); and 2. How engaged and energized do you feel at work, from 1 (not at all) to 5 (very energized).
The results come in and the data is clear: people who work three or more days per week at the office report they are more energized and engaged at work!
Well, well, well, finally some lever I can pull for my 10-thousand-strong workforce. I then proceed to enforce an at-least three days per week at the office policy.
Let me ask you, where's the fault in this reasoning?
First of all, people in this study weren't randomly assigned to be in-office or home office. The prior, where people work from, is self-selecting. We can reasonably assume that people that are more engaged with the work (for whatever reason) decide to go more often than not to the office.
If my stated goal is to make my workforce more engaged and energized, I could have realized that mandating people to go to the office is, at least, unsupported by the evidence I have collected.
Suppose, in an alternative scenario, that as the head of this organization I realize that people who self-select to work from home have reasons for doing so: caring for small children, avoiding a long commute, being less subjected to noise from an open-plan office. By allowing people to self-select for the working environment that better suits them, I'm improving employee satisfaction and engagement, as well as my ability to retain sought-after talent who can choose to work elsewhere.
Oh, well.