Sunday, April 21, 2013

Neuroscience, statistical power and how to increase it

There has been quite a bit of buzz recently about the Button et al. Nature Reviews Neuroscience paper on statistical power. Several similar reviews have been published in psychology and other disciplines and come to broadly the same conclusion - that most studies are underpowered. The main difference with the Button et al. study is that they don't just find that typical studies are underpowered to  detect the average size of effect in a field, but they find extremely low power in neuroscience research (around 20%, and below 10% for some subfields). Contrast this with a typical review from psychology and related disciplines. Sedlmeier and Gigerenzer (1989, Table) report power to detect a medium effect size ranging from 37% to 89%. David Clark-Carter (1997) reviewed papers in the British Journal of Psychology and found power to detect a median effect of 59%. Thus the power of typical research in psychology is not that high, but (if we make fairly reasonable assumptions about the size of typical effects in the discipline) estimates appear to be around 60% rather than the 20% found in the Button et al. paper. What caught me interest, however, was some of the responses to the publication in blogs and blog comments. For example one of the comments on Ed Yong's piece stated
Another argument for parallel recording. Traditional, one-neuron-at-a-time neurophysiological papers study 10s of neurons. Multi-electrode studies have 100s or 1000s of neurons. Enough power? Maybe not, but way more power than single neuron recording.
A similar sentiment arises in Matt Wall's piece:
MRI scanners have significantly improved in the last ten years, with 32 or even 64-channel head-coils becoming common, faster gradient switching, shorter TRs, higher field strength, and better field/data stability all meaning that the signal-to-noise has improved considerably. This serves to cut down one source of noise in fMRI data – intra-subject variance. The inter-subject variance of course remains the same as it always was, but that’s something that can’t really be mitigated against, and may even be of interest in some (between-group) studies. On the analysis side, new multivariate methods are much more sensitive to detecting differences than the standard mass-univariate approach.

Matt's piece is thoughtful and I would agree which much of what he writes, but the idea that increasing observations within a person will do much to resolve the problem is probably not correct (and for reasons that Matt mentions). To understand why, consider the typical nature of the experimental designs being used. As I understand it there are essentially two main types of design: a nested repeated measures design or a factorial design with fully crossed random effects. There are many variants (e.g., additional layers of nesting, additional fully crossed random factors), but the aforementioned characteristics capture characteristics most of the designs I'm familiar with in cognitive neuroscience (and possibly in many other areas of neuroscience).

In a nested repeated measures design there are m multiple measurements within each of n persons. The multiple measurement are correlated in some way so - in general - the power of the design has an effective sample size that is less than N (where N = n * m). It turns out that for most such designs the limiting factor in power and precision is n and not m or N.

This isn't always true, but generally experimental designs get refined quite quickly to reduce the impact of sources of error in the repeated measurements. This could be increasing the number of trials or tightening up the experimental procedures (e.g., instructions, quality of materials) or by technical advances that reduce measurement error for each measurement occasion. Once you get measurement error per trial moderately low, improving measurement error further has very little impact on power. That's because the error at each measurement occasion includes transient error that can't really be eliminated (many behaviours are just inherently variable from occasion to occasion) and because as you reduce these errors the other sources of error in the study become the main limiting factors on power.

For example, when I was a PhD student many reaction time experiments used computers with dodgy clocks that couldn't time more accurately than 1/60th of a second or around 17 ms (and perhaps many still do). If you are looking for a priming effect of say 30 milliseconds this would seem like a major problem. However, you can get pretty accurate inferences without much bias or loss of power as long as the variability of the RTs are fairly large - which they generally are (Ulrich & Giray, 1989). For most neuroscience work involving humans the limiting factors in power (once you are dealing with a reasonably refined experimental set-up) are therefore related to n. A further consideration is that top level n generally needs to be in the 30-50 range or (preferably) greater just to get vaguely reasonable estimates of the variances and covariances if you are dealing with data sampled from approximately normal distributions. Smaller samples also make the study more vulnerably to a atypical 'outlier' at the person level (e.g., a participant using a weird strategy or responding randomly) or to selective bias by the experimenters (dropping a 'noisy' participant because they go against the hypothesis). Having small n at the top level may also make focus on statistical significance rather than interval estimates of effects more attractive (because it reduces precision of measurement). In other words it encourages studies that find 'evidence' of an effect and discourages focus on accurate estimates of the size of an effect.

For fully crossed random factor designs the situation is worse. In these designs you sample both people and stimuli (e.g., faces, words, etc.) from a large (conservatively assumed to be infinite) population. The limiting factor on power now probably depends not on n1 (the number of people) or n2 (the number of stimuli) but the the smaller of n1 and n2 (assuming you want to make inferences that generalise to people and stimuli not in your experiment). Thus having 1000 people has little effect on power if your study uses only two faces (and you want to make general inferences about face perception rather than perception of those two faces). This is a slight oversimplification - as it assumes that the stimuli and people are equally variable in terms of what you measure - however it is a good rule of thumb unless variability in either people or stimuli is large enough to swamp the other source.

There is also an important caveat here - I'm assuming that you do the statistics correctly. Many, many studies still analyse fully crossed random factor designs as if they are nested, resulting in spuriously high power (see here for an earlier blog post on this).

This analysis should hold whenever: i) the basic experimental procedure is fairly well-refined, ii) variability between people (or stimuli in appropriate designs) on the measures of interest are non-negligible. Thus it should hold more often than not in psychology and related areas of neuroscience. There are undoubtedly subfields in which it won't hold (e.g., some areas of vision research where n = 2 studies are common because individual differences on the crucial effects are low).

Postscript

One objection to my conclusion is that if neuroscience power is limited by number of participants and number of stimuli, why do small samples persist? This is a good question. I offer three main answers: i) As with psychology (where power is also generally low, remember) you can have low power for each test if you have multiple tests. Maxwell (2004) pointed out that a typical 2 x 2 factorial design might only have 50% power per test but that means 87.5% chance of at least one significant result. Thus low power generally produces something statistically significant (though it also predicts that replications will generally fail to show consistent patterns of statistical significance), ii) researcher degrees of freedom (see Simmons et al., 2011), and iii) many research teams run many small studies (e.g., undergraduate and masters projects) so (in some cases) there are many unreported studies with null results.

References

Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: causes, consequences, and remedies. Psychological Methods, 9, 147–63.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-66.

Ulrich, R., & Giray, M. (1989). Time resolution of clocks: Effects on reaction time measurement - Good news for bad clocks. British Journal of Mathematical & Statistical Psychology, 42, 1-12.

Wednesday, April 10, 2013

Reflecting on the end of history illusion illusion

A while back Jon Sutton at The Psychologist asked my opinion on the end of history illusion. This was sparked by an interesting Science paper by Quoidbach, Gilbert and Wilson. Blogger and mathematician Jordan Ellenberg had written a blog post arguing that the paper makes a mistake: "a somewhat subtle mistake, but a bad mistake, and one which kills a big chunk of the paper".

Jon wanted a second opinion, and after a bit of reading I replied that Ellenberg's criticisms were valid. I meant to blog about it at the time but got caught up in other things. Consequently I missed the BPS research digest piece on it. 

The reason for writing this blog post is because the flaw that Ellenberg spotted is quite interesting in its own right and because both the description by Ellenberg and the description in the Research Digest article probably don't explain it clearly enough for some readers to appreciate. Ellenberg's piece is (I hasten to add) crystal clear but relies on a reader being comfortable with the formal, mathematical approach he takes (which many psychologists won't be). The Research Digest description just gives the brief gist (with a link to Ellenberg for the full picture). Here is my belated attempt at a psychologist-friendly interpretation with no formal notation - and as little maths as possible.


According to the end of history illusion people underestimate how much they will change in the future. For example, someone asked to predict how their personality would change in the next ten years would come up with a prediction closer to their original position than their actual position. Quoidbach et al. tested this mainly by asking people to predict future values on some psychological variable (e.g., a personality test score) and then showing that actual change is much greater than the difference between the original and predicted scores. This seems highly plausible, but Ellenberg pointed out that the difference in the predicted and original scores is a different quantity from the expected (absolute) change in scores.

Why is this? Perhaps the easiest way to understand is to work through a simple example. Imagine that my extraversion score is 50 on a scale that goes from 0 (extremely introverted) to 100 (extremely extraverted). A researcher then asks me to predict my extraversion score in 10 years time. I, being a keen observer of human nature (bear with me on this if you know me - it is just an example), am aware that personality is not fixed and judge that I am likely to change quite a bit - say 15 points - on the scale. However, I might get more extraverted or I might get more introverted (depending on how life treats me over the next ten years). Given that I'm in the middle of the scale, I could end with a score of 35 or a score of 65. Thus I predict that my extraversion score after 10 years will be (35 + 65)/2 = 50. It looks as though I've predicted zero change, when what I've done is give the best prediction I can (one that minimizes my prediction error). Had I instead been asked to give the absolute change I expected, my answer would have been different. It would have been (15 + 15)/2 = 15 (not zero).

Although the example is simple it captures the essence of the problem. Commenters on Ellenberg's blog looked again at the raw data that Quoidback et al. provided. According to their analyses the end of history illusion largely disappears when analyzed correctly (though only some of the data sets support such a reanalysis). Thus if the end of history illusion effect exists (and the basic premise seems highly plausible) it is quite probably a much smaller and more fragile effect than originally thought. That makes sense to me - because I'm not sure that such a bias could be both pervasive and large in the face of the counter-evidence available to people about past change in themselves and change in others.

My continued interest in the effect is slightly different. There seems to be a cognitive illusion at work here - one that makes the difference between the original score and predicted score appear to be a good measure of an entirely different quantity - the expected absolute change in score ...






Monday, January 28, 2013

The growth of Bayesian methods in psychology

The British Journal of Mathematical and Statistical Psychology has published a target article (with commentaries and reply) by Andrew Gelman and Cosma Shalizi on philosophy and the practice of Bayesian statistics.

Mark Andrews and I introduce the target article with an editorial aimed at providing some background to psychologists who are interested in Bayesian statistics but need a little back story. Our main aim was to try and indicate that the debate about Bayesian statistics has moved on from the frequentist vs. Bayesian argument and on to more interesting territory - illustrated both by the target article and the commentaries.

Also I believe that as of writing access is free to the target article and commentary ...


Andrews, M., & Baguley, T. (2013). Prior approval: The growth of Bayesian methods in psychology. British Journal of Mathematical and Statistical Psychology, 66, 1–7. doi:10.1111/bmsp.12004


Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66, 8–38. doi:10.1111/j.2044-8317.2011.02037.x





Bookdepository

Quidco link

I love Quidco cashback