Sunday, April 21, 2013

Neuroscience, statistical power and how to increase it

There has been quite a bit of buzz recently about the Button et al. Nature Reviews Neuroscience paper on statistical power. Several similar reviews have been published in psychology and other disciplines and come to broadly the same conclusion - that most studies are underpowered. The main difference with the Button et al. study is that they don't just find that typical studies are underpowered to  detect the average size of effect in a field, but they find extremely low power in neuroscience research (around 20%, and below 10% for some subfields). Contrast this with a typical review from psychology and related disciplines. Sedlmeier and Gigerenzer (1989, Table) report power to detect a medium effect size ranging from 37% to 89%. David Clark-Carter (1997) reviewed papers in the British Journal of Psychology and found power to detect a median effect of 59%. Thus the power of typical research in psychology is not that high, but (if we make fairly reasonable assumptions about the size of typical effects in the discipline) estimates appear to be around 60% rather than the 20% found in the Button et al. paper. What caught me interest, however, was some of the responses to the publication in blogs and blog comments. For example one of the comments on Ed Yong's piece stated
Another argument for parallel recording. Traditional, one-neuron-at-a-time neurophysiological papers study 10s of neurons. Multi-electrode studies have 100s or 1000s of neurons. Enough power? Maybe not, but way more power than single neuron recording.
A similar sentiment arises in Matt Wall's piece:
MRI scanners have significantly improved in the last ten years, with 32 or even 64-channel head-coils becoming common, faster gradient switching, shorter TRs, higher field strength, and better field/data stability all meaning that the signal-to-noise has improved considerably. This serves to cut down one source of noise in fMRI data – intra-subject variance. The inter-subject variance of course remains the same as it always was, but that’s something that can’t really be mitigated against, and may even be of interest in some (between-group) studies. On the analysis side, new multivariate methods are much more sensitive to detecting differences than the standard mass-univariate approach.

Matt's piece is thoughtful and I would agree which much of what he writes, but the idea that increasing observations within a person will do much to resolve the problem is probably not correct (and for reasons that Matt mentions). To understand why, consider the typical nature of the experimental designs being used. As I understand it there are essentially two main types of design: a nested repeated measures design or a factorial design with fully crossed random effects. There are many variants (e.g., additional layers of nesting, additional fully crossed random factors), but the aforementioned characteristics capture characteristics most of the designs I'm familiar with in cognitive neuroscience (and possibly in many other areas of neuroscience).

In a nested repeated measures design there are m multiple measurements within each of n persons. The multiple measurement are correlated in some way so - in general - the power of the design has an effective sample size that is less than N (where N = n * m). It turns out that for most such designs the limiting factor in power and precision is n and not m or N.

This isn't always true, but generally experimental designs get refined quite quickly to reduce the impact of sources of error in the repeated measurements. This could be increasing the number of trials or tightening up the experimental procedures (e.g., instructions, quality of materials) or by technical advances that reduce measurement error for each measurement occasion. Once you get measurement error per trial moderately low, improving measurement error further has very little impact on power. That's because the error at each measurement occasion includes transient error that can't really be eliminated (many behaviours are just inherently variable from occasion to occasion) and because as you reduce these errors the other sources of error in the study become the main limiting factors on power.

For example, when I was a PhD student many reaction time experiments used computers with dodgy clocks that couldn't time more accurately than 1/60th of a second or around 17 ms (and perhaps many still do). If you are looking for a priming effect of say 30 milliseconds this would seem like a major problem. However, you can get pretty accurate inferences without much bias or loss of power as long as the variability of the RTs are fairly large - which they generally are (Ulrich & Giray, 1989). For most neuroscience work involving humans the limiting factors in power (once you are dealing with a reasonably refined experimental set-up) are therefore related to n. A further consideration is that top level n generally needs to be in the 30-50 range or (preferably) greater just to get vaguely reasonable estimates of the variances and covariances if you are dealing with data sampled from approximately normal distributions. Smaller samples also make the study more vulnerably to a atypical 'outlier' at the person level (e.g., a participant using a weird strategy or responding randomly) or to selective bias by the experimenters (dropping a 'noisy' participant because they go against the hypothesis). Having small n at the top level may also make focus on statistical significance rather than interval estimates of effects more attractive (because it reduces precision of measurement). In other words it encourages studies that find 'evidence' of an effect and discourages focus on accurate estimates of the size of an effect.

For fully crossed random factor designs the situation is worse. In these designs you sample both people and stimuli (e.g., faces, words, etc.) from a large (conservatively assumed to be infinite) population. The limiting factor on power now probably depends not on n1 (the number of people) or n2 (the number of stimuli) but the the smaller of n1 and n2 (assuming you want to make inferences that generalise to people and stimuli not in your experiment). Thus having 1000 people has little effect on power if your study uses only two faces (and you want to make general inferences about face perception rather than perception of those two faces). This is a slight oversimplification - as it assumes that the stimuli and people are equally variable in terms of what you measure - however it is a good rule of thumb unless variability in either people or stimuli is large enough to swamp the other source.

There is also an important caveat here - I'm assuming that you do the statistics correctly. Many, many studies still analyse fully crossed random factor designs as if they are nested, resulting in spuriously high power (see here for an earlier blog post on this).

This analysis should hold whenever: i) the basic experimental procedure is fairly well-refined, ii) variability between people (or stimuli in appropriate designs) on the measures of interest are non-negligible. Thus it should hold more often than not in psychology and related areas of neuroscience. There are undoubtedly subfields in which it won't hold (e.g., some areas of vision research where n = 2 studies are common because individual differences on the crucial effects are low).

Postscript

One objection to my conclusion is that if neuroscience power is limited by number of participants and number of stimuli, why do small samples persist? This is a good question. I offer three main answers: i) As with psychology (where power is also generally low, remember) you can have low power for each test if you have multiple tests. Maxwell (2004) pointed out that a typical 2 x 2 factorial design might only have 50% power per test but that means 87.5% chance of at least one significant result. Thus low power generally produces something statistically significant (though it also predicts that replications will generally fail to show consistent patterns of statistical significance), ii) researcher degrees of freedom (see Simmons et al., 2011), and iii) many research teams run many small studies (e.g., undergraduate and masters projects) so (in some cases) there are many unreported studies with null results.

References

Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: causes, consequences, and remedies. Psychological Methods, 9, 147–63.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-66.

Ulrich, R., & Giray, M. (1989). Time resolution of clocks: Effects on reaction time measurement - Good news for bad clocks. British Journal of Mathematical & Statistical Psychology, 42, 1-12.

No comments:

Post a Comment