Tuesday, August 21, 2012

Yet more on p values ...

I wasn't going to post on this ... but couldn't resist. A recent QJEP paper reports suspicious patterns in p values across three psychology journals.

This has been blogged elsewhere (see here and here), so I haven't got too much to add. Although I generally like the paper and am glad it got published in a decent journal (I'm an EPS member and subscriber so I'm glad they published it), I can't say I find the main finding surprising. Everything we know about how significance testing is used in practice would predict the basic pattern of a bump just below p = .05.

My main criticism of the paper would be that they didn't use a multilevel model to account for dependencies in p values from the same paper - while I agree with the authors that the outcome wouldn't change, I think it could be potentially informative to model the dependency (and a simple multilevel model shouldn't be more than a few minutes to set-up and run if the data are coded by study). An even better approach for future work might be to sample papers from multiple journals and thus estimate the stability of the pattern across journals, disciplines etc.

The main reason I'm blogging is because of a few points raised at other blogs. The generally excellent BPS research digest reported:

Masicampo and Lalande said their findings pointed to the need to educate researchers about the proper interpretation of null hypothesis significance testing and the value of alternative approaches, such as reporting effect sizes and confidence intervals. " ... [T]he field may benefit from practices aimed at counteracting the single-minded drive toward achieving statistical significance," they said.

I couldn't find the effect sizes comment in the paper (but it may have been from conversations with the authors). I'm pretty sure all the journals surveyed reported effect sizes - point estimates of the magnitude of the effect. Perhaps they meant standardized effect size, but if so the advice is in my view doubly wrong. Standardized effect size has all sort of problems (see here and here for my take).

I won't repeat myself too much over this, but point out that standardized effect size generally obscures the thing you want to measure (as well as distorting it in several ways). That is the first reason why they aren't a solution to dodgy analyses - they will tend to make it harder for people to spot the problems in a study when there are meaningful, well-understood units (such as percentage recall, response times and so forth). They can maybe, sometimes help you interpret arbitrary measures - but generally only if you don't know the range of the scale. For instance, a shift of 3 points on a 7 point scale is huge and might make people suspicious whereas an effect size estimate of d = .6 or r = .4 might not.

The second reason is that effect size in general (and standardized effect size in particular) is at least as vulnerable as p values to publication bias, optional stopping and all the other known problems with statistical significance as routinely practiced. Combine small studies with a threshold for publication (e.g., p < .05 or d > 0.5) and effect size is substantially biased upwards (the bias is lower for large studies). Standardized effect size makes this worse because small samples tend to underestimate population variance and thus inflate the effect size estimate yet further. The final straw is that standardized effect size is easy to game. For instance, Cohen's d type metrics for a paired design come in at least four flavours (removing between-subject variance or not, crossed with a pooled or non-pooled error standard deviation). Removing the between-subject effect  increases the reported d value artificially relative to an independent subjects design (even though the magnitude of the effect is generally assumed not to change). A similar problem occurs with variance explained measures such as eta-squared. People generally report partial eta-squared inappropriately in preference to classical eta-squared (which is generally smaller except in very simple designs).*

So, effect size is a red herring. The point on confidence intervals (and interval estimates in general) is a good one.

The piece at The Scholarly Kitchen (TSK) annoyed me a little. The tone implied that the problem with p values a problem in psychology. It isn't. It is a problem in any field that obsesses about p values (including most quantitative work in social sciences, medicine and science). I'm prepared to bet you'd get the same patterns in medical research (and they could well be much worse given some of the pressures to publish positive results in that area). More interesting TSK  notes that JEP: General fared better than the other two journals looked at. TSK speculates that this is a result of better editing and reviewing. As it is the only one of the three journals I have reviewed for, I am disposed to agree with this. However, I suspect it may simply be that JEP: General publishes papers with methods that dilute the bias. In particular, I think it publishes more papers that focus on model fit and hence look for p > .05 more often.

The QJEP paper also mentions replicability statistics in passing. Another red herring, I think. Replicability can't be reliably estimated from small samples (see here).


* Actually, if you have to report eta-squared measures you should use generalized eta-squared (see Olejnik & Algina, 2003; Baguley, 2012).

References


Baguley, T. (2012, in press). Serious stats: A guide to advanced statistics for the behavioral sciences. Basingstoke: Palgrave.
Olejnik, S., & Algina, J. (2003). Generalized eta and omega squared statistics: Measures of effect size for some common research designs. Psychological Methods, 8, 434-447.





Saturday, August 04, 2012

What's up with social psychology?

... or to be more precise, what's up with experimental social psychology?

A number of high profile cases of suspected (in some cases admitted) fraud have been highlighted in psychology recently - my own discipline - but they've (nearly all arisen in experimental social psychology. If you are unaware, the best known cases seem to be Diederik Stapel, Dirk Smeesters and now Lawrence Sanna. Another high profile case, Marc Hauser, is in a somewhat related field (but a stretch to call it experimental social psychology). The not so recent case of Karen Ruggiero could also be included.

A separate problem - set aside from the problem of deliberate fraud - are some controversies over specific studies which have apparently hard to replicate effects. The discussion here is around Bargh's priming study and Bem's ESP study. There is ample discussion of this elsewhere, but the main point is that the standard practices in experimental social psychology may encourage publication of spurious effects.

Fraud and other kinds of academic misconduct are rare and far from confined to psychology - see retraction watch (though the scale of Stapel's fraud may have raise psychology's profile on its own). However, the spotlight is focusing heavily on social psychology the moment. My initial view was that experimental social psychology was coming up purely by coincidence, but the recent cases have made me wonder. In the rest of this post I'm going to sketch out some thoughts on what might be going on.

(1) Coincidence. There remains quite good evidence for the whole thing being coincidence. Psychology and social psychology are popular fields with lots of researchers so there will (sadly) be a few frauds. Deliberate fraud is a rare event and (to quote the late, great Robert Abelson) "probability is lumpy". Discrete random events are appear to be evenly or smoothly distributed in the long run (averaging over many events) so rare events are usually clustered in a given sample of small, fixed n. So if you look a 10 or 100 fraud cases there are bound to be clusters among certain disciplines.

(2) Deep discipline-specific flaws. Is there something fundamentally wrong with experimental social psychology research itself? Perhaps. The Bem and Bargh cases point to problems such as lack of replication, pressure to publish, over-emphasis on p values, intolerance of messy data and desire for surprising or counter-intuitive effects. The problem with most of these arguments is that they are not discipline-specific and they are often cited as factors leading to fraud in other disciplines. On the other hand it may be that one or more of these factors are particularly pronounced in experimental social psychology (and I'll come back to this point later).

(3) Enhanced scrutiny. There are three strong reasons to suspect enhanced scrutiny contributes to the recent cases. First, the reports of fraud or other problems are not independent. A case - particularly a big one such as the Diederik Stapel case - necessarily draws further scrutiny to particular journals, groups of researchers and perhaps the whole of a field or discipline. Second, several of the cases were uncovered by the same person: Uri Simonsohn. As Simonsohn works broadly in the area of experimental social psychology, it isn't that surprising that he applied his fraud detection tools to suspicious studies in his own field. Third, findings in experimental social psychology compete for explanatory power more closely with folk psychology explanations than most other fields. To put it another way, just about everyone can assess the plausibility of a study that looks at whether exposing people to old age related material in the lab makes walk faster when they leave the lab or whether eating meat makes you more aggressive. Moreover, they often have strong opinions on these kinds of findings.

At the moment, I think the enhanced scrutiny explanation looks like a strong contender to me. I wouldn't rule out coincidence, but I think we can expect to see a few more dodgy studies unearthed. I think we can also expect to see the label of social psychology expanded to include suspect research in related areas (such as Marc Hauser's work).

Nevertheless, I do think there is one area in which experimental social psychology may be particularly vulnerable to fraud or questionable research practices. High status journals often seek interesting (aka surprising) effects and large effect sizes in the papers they publish. Such findings are more likely to be false (e.g., see here). This is part of a general problem with statistical significance which acts a filter (see Andrew Gelman's blog for lots of discussion on this). A single small experiment can usually only detect relatively 'big' effects - hence it overestimates the size of effects. When you add an implicit requirement for 'big' effects you are biasing your journal or discipline to spurious and fraudulent results. Thus far experimental psychology isn't so different from other fields where small studies are common (e.g., much of medicine, health, neuroscience, biology, and education). The problem may be that effects are inherently smaller in experimental social psychology than other areas of psychology.

I've put the label 'big' in italics because what we're really talking about is the detectability or discriminability of an effect (standardized effect size) - which is its size relative to the noise or error in the data. Experiments with social stimuli are inherently noisy because there are so many variables to control for and because it is often difficult to use big manipulations (as they tend to be pretty obvious to participants). Of course many of the effects may truly be tiny. For example the age priming effect seems plausible to me but I can't believe it would be a large absolute effect in terms of walking speed (easily swamped by other factors or exaggerated by them) - thus my guess is that the original Bargh study over-estimated the effect size (as most early studies tend to).

I think that social psychology and psychology will learn from these cases and the increased scrutiny that  seems to be around. I hope we will improve our statistical work, place greater value on replication and reduce the ridiculous pressure to publish ground-breaking, surprising, counter-intuitive work with high frequency. Ground-breaking work will get published, but you can't really tell what research will have real scientific impact until years later (at least two or three years and often much longer, in my view). I hope that psychologists (particularly editors and reviewers) will be more tolerant of messy data (see here) and not quite perfectly watertight conclusions. Many fraudulent studies are detected because of data that are far too clean (real data tend to be messy).

Bookdepository

Quidco link

I love Quidco cashback