We often get research students visiting us to get help with analysing their data, even though it is not actually our job to help them and we are not formally qualified to help either. But I still sit with them and listen to their woes and give what advice I can, because I know how little support for statistics there is at this university.

Anyway, a lot of them come with their data all neatly organised into summary statistics: they have gone to quite some effort to calculate the mean or percentage in each group and present it in a lovely little table. They then ask how they can get their statistical analysis program to compare the means or percentages. Unfortunately, it’s at this stage I need to tell them that those summary statistics are of little use and what their stats program needs is the original big list of data. Indeed, their stats program could have quite simply calculated those summary statistics for them from the raw data.

The poor students are usually quite crestfallen at this point because they really believed they were being helpful, and because they feel that all their hard work has been in vain. After the sting has worn off they are actually very surprised to learn that it’s the original data that the statistician and stats program needs. I long ago ceased to be surprised by their surprise, but still I wondered *why* they were surprised.

I think I’ve come up with two plausible explanations.

One of the explanations might be the way that results are presented in published research. If you pick up a research article in almost any discipline and flick to the results section, what you will see is a table listing the means or percentages in each group, with a p-value attached to tell you how different they are. If you look at the analysis section, they will say that they used this or that procedure to compare the means or compare the percentages. It’s not that surprising then that they think that it is the means themselves or the percentages themselves that are being directly compared in the statistical procedure, and that these values are the inputs the stats program needs.

Another reason might be that we are inadvertantly sending this message in our own traditional introductory statistics courses. Usually, when we teach hypothesis testing, we focus very strongly on the null hypothesis, making sure the students carefully define the parameters of interest. And many of us teach them that the best way to choose which statistical procedure to do is to look at the null hypothesis they have made.

For example, in a situation where a numerical outcome might be different on average in two different situations, we make sure they always say “H_{0}: μ_{1} = μ_{2}” right at the beginning. Upon seeing this null hypothesis they are supposed to respond “Of course! The unpaired t-test.” And right there we have associated the statistical procedure to directly comparing means! And then the connection is strengthened because in this case the test statistic itself actually *is* calculated using summary statistics — the means and standard deviations of the two groups. So we teach them to think about means and proportions as the basis on which the statistical methods work.

But the problem is that if you’re going to get a computer to do any part of the analysis then you might as well get it to do *all *of it. It’s much simpler to get the computer to calculate all of those summary statistics and the test statistics and p-values all in one go for you. In fact, most statistical programs do not even give you the option of starting with aggregate data. And worse than this, some of the most common procedures such as ANOVA most emphatically do *not* work on the aggregate data directly, but the calculation requires *all* the data. And let’s not even get into non-parametric procedures where the null hypothesis itself doesn’t even have parameters.

It seems to me that by focussing so strongly on means and proportions in our research publications and statistics teaching, we are setting up a whole lot of people to waste their time finding aggregate statistics.

Now we can’t control the way data is presented in published research — it really does make sense to report the means of each group! However, we *can* control where we put the emphasis when we teach. I think that perhaps we could teach them to decide what stats to do based on the raw data, how it is organised, and the variables it contains. It’s only a theory, but I think that then they might expect that it is the raw data they need to bring to their statistician, and not the aggregates.

Thanks for thinking and posting about statistics. Having struggled through the first year statistics course (2011) I now find myself more confused not less. The biggest problem that I have is the lack of discussion around the limitations of statistics. There is also something fundamentally wrong with the way in which statistics is being taught. It almost feels like there needs to be an introduction to introductory statistics.