Thursday, September 15, 2011

A problem of significance

Several people have drawn my attention to a recent article on a common error in published statistical analyses in neuroscience. Sander Nieuwenhuis, Birte Forstmann and Eric-Jan Wagenmakers published (in Nature Neuroscience) a critique of statistical analyses in the neuroscience literature. This paper has been written about by Ben Goldacre and Andrew Gelman (who published an article on the general problem some time ago) - so I won't go into too much detail.

The point of interest for me is that the error concerns something that most psychologists should know all about (and hence should be expected not to make the error). It concerns the case of two differences, one statistically significant and one non-significant. For example, group 1 may show a significant difference between experimental condition and placebo (for a drug intervention), while group 2 do not. A naive interpretation is that the drug works for group 1 but not group 2. This is not necessarily true. The proper test of a difference in effects of the drug between groups is an interaction test. Psychologists tend to avoid this error because we have heavily trained in ANOVA as undergraduates (certainly in the UK and probably also in the US and most of Europe). Even if we fail to learn this,  reviewers and editors (in psychology) tend to spot the error.*

Are psychologists then entitled to feel a little bit smug? Perhaps, but only a little. First, I think the reason we are relatively good performers on this point is because we tend to view many statistical analyses through an "ANOVA" lens. Factorial ANOVA (in which factors are orthogonal) includes the interaction term by default. The 2 by 2 factorial ANOVA is the workhorse of experimental psychology. Our familiarity with this type of design and analysis makes this easy to spot. Second, our ANOVA lens leads to other errors - notably dichotomizing continuous variables (e.g., via median split) in order to squeeze them into an ANOVA design. This always decreases statistical power, and can - albeit infrequently - produce spuriously significant effects (see MacCallum et al., 2002). These errors are sometimes less serious than the difference of differences/interaction error (but are not harmless).

The real test then, is whether psychologists make the same (conceptual) error in a different context. The obvious context is that of association rather than difference. If males show a significant correlation between testosterone and aggression (e.g., r = .5, N = 25) and females don't (e.g., r = .3, N = 25), the correlation between testosterone and aggression is not significantly bigger for males than females. To confirm this you'd need to construct a test or (better still) confidence interval for the difference in correlations. This is hardly ever done - and, in my experience, psychologists frequently make this kind of claim without backing it up.** Methods for testing differences in correlations are a bit fiddly (e.g., depending on overlap or lack of overlap in the measurements), and rarely taught at undergraduate or even postgraduate level. The methods that are taught are also often a bit dodgy (see Zou, 2007, for some better alternatives).

Also note that (in both cases) the error can work the other way. Two correlations could be non-significantly different from zero but different from each other (e.g., r = .5 and r = -.5 with N = 10).

Postscript

There is, I think, a lesson or two here. A minor lesson is that interactions are bit more complicated than psychologists (particularly those very familiar with ANOVA) often think. I could write more on this (and do a bit in my forthcoming book). A major lesson is that this concept (the difference between significant and non-significant is not necessarily also statistically significant - see Gelman & Stern, 2006) is probably quite tricky. It may be worth exploring why ... I suspect it is because of several factors.

See updates here and here.

References


Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. American Statistician, 60, 328–331.

MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19-40.

Nieuwenhuis, S., Forstmann, B. U., & Wagenmakers, E.-J. (2011). Erroneous analyses of interactions in neuroscience: A problem of significance. Nature Neuroscience, 14, 1105-1107.

Zou, G. Y. (2007). Toward using confidence intervals to compare correlations. Psychological Methods, 12, 399-413.



 Do I have any support for this position? Yes: anecodotal support (e.g., from editing or reviewing many dozens of papers) and some support from Nieuwenhuis et al. They found the error more prevalent in cellular and molecular neuroscience. ANOVA is core training in psychology and widely used in cognitive and behavioural neuroscience - and I'd argue that this reflects the influence of psychologists working in this area and of neuroscientists trained in and using similar methods.

** Do I have any support for this position? A little. It is easy to find basic psychology texts with ANOVA but without tests of differences in correlations being mentioned. It is rare to find tests of CIs of differences in published papers.

No comments:

Post a Comment