Growth, correlation, and causation

This Tutorial is being hosted in partnership with: NBCGIB/CCAM/PPGMC/UESC Núcleo de Biologia Computacional e Gestão de Informações Biotecnológicas, Centro de Computação Avançada e Modelagem, Programa de Pós-Graduação em Modelagem Computacional em Ciência e Tecnologia, Universidade Estadual de Santa Cruz, Bahia, Brazil.

Versions in other languages and an interactive dashboard of COVID-19 data can be found at

The following plot shows the number of confirmed cases of COVID-19 infection in the state of New South Wales in Australia, for each day during a period beginning the 4th of March.

Linear regression is often used to predict one variable based on another, but it assumes the data fall on close to a straight line, plus or minus some random variability.

The line that best fits the data is shown above in dark red. The fit is not very good! The relationship between number of cases and days is nonlinear - they do not fall on a straight line. Here, linear regression is not very useful.

Next, let’s consider the shape of the data more closely.

Below, some blue dots have been added to the plot.

If the number of cases had increased by 20% each day, then the black dots (the real data) would fall exactly on top of the blue dots. In other words, the blue dots show the number of cases that would have occurred if there had been a 20% increase in cases each day. The raw numerical (not percentage) change in cases is greater for each day. This is because 20% of a larger number is greater than that of a smaller number. As a result, the graph becomes steeper and steeper as the number of cases grows.

If the number of cases does grow by the same percentage each day, this is referred to as “exponential growth”. Each person infected may pass the infection on to a certain number of others. Thus the change in cases reflects the number who are already infected, yielding a percentage increase each day rather than a constant increase in cases each day.

Exponential growth

“Exponential growth” refers to when something increases by the same percentage over successive time points. Below, the data have been replotted with the vertical axis as percent increase over the previous day rather than total number of cases.

The blue dots now form a horizontal line, because they were created by calculating a 20% increase compared to the previous day. The reason that virus infections can grow in this fashion is that each new person infected contributes an additional number of cases to the next day, by passing the virus on to others. Thus the new number of cases is some multiple (percentage added to) of the previous day’s cases.

For the real data (the black dots), this graph makes it obvious that after about 23 March (soon after flights from overseas were greatly reduced and social distancing began), no longer was there a similar percentage increase on each day. The percentage increase began to dwindle.

The below plot shows, for each day, the percentage increase in confirmed cases from the previous day.

A logarithmic vertical axis

While the % increase vertical axis is useful, it unfortunately does not show the cumulative number of cases, only the change since the previous day. As a result, from that graph one can’t see how high the caseload has gotten. This is one reason that people often use a logarithmic vertical axis in situations of constant (or even approximate) percentage growth.

On a non-logarithmic (linear) axis, a constant increase in number (not percentage) of cases results in a straight line. The greater the daily increase, the steeper the slope of the line. In other words, constant addition day-by-day results in a constant slope on a graph.

When growth is exponential, we are interested in knowing what the number of cases is being multiplied by each day, not how much is being added to it. In the case of 20% growth, for instance, on each day the number of cases is the previous number plus 20%, which can be calculated by multiplying the previous day’s number of cases by 1.2.

Logarithms turn multiplication into addition, which means that on a logarithmic axis, stepping upward by a constant amount does not mean the number of cases has increased by the same number each time, but rather that it’s been multiplied by the same number each time.

As a result, our 20% growth points fall on a straight line, because each successive point is a result of multiplying (rather than adding) by a constant, namely 1.2.

In the plot, notice that the vertical axis labels do not mark out equal intervals. That is, taking equal steps up does not result in adding one number over and over, a number corresponding to the step size. Rather, taking equal steps upward corresponds to multiplying by a number that reflects the step size. Notice the interval between the successive y-axis labels, for example - it is the previous label multipled by approximately three.

The above plot shows that any particular daily growth rate (multiplication by a particular factor) results in a straight line of corresponding slope.

Plotting the actual data from NSW (black dots), thanks to the blue dots comparison, we can see that it was growing exponentially at nearly 20% for some time. After 28 March, growth was slower. The data after 28 March no longer increase as steeply as the dots that indicate 20% growth.

Exponential growth is not always bad. Investments often yield exponential growth, because on average their value increases by some percentage each year. For example, the U.S. stock market increased on average 14% a year between 2000 and the end of 2019. If your parents had invested $2000 for you in the year 2000, at the beginning of 2020 it would have been worth about $27,000.

Another way to get an intuition for what a particular growth rate means is to realize that each growth rate corresponds to a doubling every time a particular number of days elapses. The 14% growth rate of the stock market during the first twenty years of the century, for example, meant a doubling approximately every five years. Here I’ve added the doubling times to the 10, 20, and 30 percent growths.

Population density and flu mortality

Hoffman & Cox plot data from the 1918 flu pandemic. Their plot, below, shows the mortality rate against the population density, with one data point for each county in Kansas and Missouri.

Not only is the vertical axis logarithmic on this plot, but also the horizontal axis is logarithmic. But you don’t need to worry about that to answer the following.


Inferring causation

When a correlation is evident from a scatterplot of Y against X, people tend to infer that X caused Y.

The correlation of Y with X and X with Y are by definition the same, so in principle a plot like that of the above is just as consistent with Y causing X as with X causing Y. However, in some cases a causal link is much more plausible in one direction than in another. For example, it is plausible that higher population density causes a higher rate of flu deaths. A more detailed causal model is that population density causes more frequent physical contact between people, which causes more flu transmission, which causes more deaths.

But could Y cause X? That is, could flu deaths cause higher population density? That’s not very plausible, illustrating that with some pairs of variables, causation in only one of the directions is likely.

There is also the possibility that a third variable, Z, causes X and Y.

Many points means more evidence

Ideally, a scatterplots will have lots of data points. If those points form a fairly consistent pattern of Y increasing or decreasing with X, this provides a lot of evidence for a correlation, which is what one wants before making the further leap to imputing a causal relationship.

A plot from the news media

This graphic, a plot plus annotations, was produced by John Burn-Murdoch of the Financial Times.

The plot has a logarithmic vertical axis, so going up a particular distance corresponds to multiplying by a particular factor. One feature you haven’t seen before, perhaps, is that the axis labels are not at equal multiplicative intervals. They have been positioned at round numbers.

Doubling every two days, as U.S. cases were for the first two weeks here, corresponds to a 41% daily growth rate.

Here is another version of the graphic. Someone on twitter has scrawled something on it.

Causal inference

The person who scrawled on this figure is encouraging people to make the causal inference that wearing a mask reduces the spread of COVID-19 infection.

In order to be confident that a statistical correlation is present between two things, many data points are usually needed, which are typically plotted on a scatterplot. After establishing a statistically significant correlation, one may make the further leap to a causal model, keeping in mind the pitfalls of inferring causation from correlation.

The inference from this plot that masks cause less spreading of infection based on only four instances of mask-wearing countries. The fewer the number of instances, the more likely it is that a third variable could explain the difference between the two sets of countries.

Certain statistical techniques can help estimate the various roles of different variables that contribute to a correlation.

The tweeter of this plot got a lot of responses to their posting of the scrawled-on plot. Here is one that points out some of the third variables that confound the comparison between the two sets of countries.

There are almost always many potentially-relevant differences between countries, from genetic to cultural to environmental and governmental. As a result, the most likely explanation for the lower infection growth rate of the blue-circled countries is a combination of causes, one of which may be masks.

Someone else had a more creative explanation of the differences between the two groups of countries:

This graphic seems to suggest that ingestion of bubble tea inhibits the virus!

While that is unlikely, it makes the point that there are typically many differences between two groups of countries, only some of which you are likely to think of at first.

Here is another response from twitter:

Dario claims that in addition to the blue-circled countries, Italians have also been wearing masks a lot, yet their infection rate has not slowed. Others have pointed out that people in China also began wearing masks at a high rate, yet their infection rate did not slow until much later.

Such suggestions highlight the fact that you should be skeptical of the claims of a random person on the internet, even if their claim comports with your intuition.

People tend to assume that numbers plotted on the same axis all mean the same thing. But can the count of cases from one country really be compared so directly to that of another country?

As Christophe Toukam implies above, countries vary in their testing policies. Moreover, these testing policies change over time, which can contribute to different growth rates for different countries.

A fourth response to the scrawled-on plot, below, does not argue with the claim. Rather, it asserts that there is independent reason to believe the claim that masks slow infection:

Although the scrawled-on plot by itself provides only very weak evidence that masks are effective, reasons to believe a claim can also come from other places.

During an exponentially-growing pandemic, one doesn’t have time to wait for strong evidence. Wearing masks is a plausible cause of reduced transmission, so it may be a good policy to adopt even if empirical evidence for its effectiveness is weak.