I wasn't going to post on this ... but couldn't resist. A recent QJEP paper reports suspicious patterns in p values across three psychology journals.
This has been blogged elsewhere (see here and here), so I haven't got too much to add. Although I generally like the paper and am glad it got published in a decent journal (I'm an EPS member and subscriber so I'm glad they published it), I can't say I find the main finding surprising. Everything we know about how significance testing is used in practice would predict the basic pattern of a bump just below p = .05.
My main criticism of the paper would be that they didn't use a multilevel model to account for dependencies in p values from the same paper - while I agree with the authors that the outcome wouldn't change, I think it could be potentially informative to model the dependency (and a simple multilevel model shouldn't be more than a few minutes to set-up and run if the data are coded by study). An even better approach for future work might be to sample papers from multiple journals and thus estimate the stability of the pattern across journals, disciplines etc.
The main reason I'm blogging is because of a few points raised at other blogs. The generally excellent BPS research digest reported:
Masicampo and Lalande said their findings pointed to the need to educate researchers about the proper interpretation of null hypothesis significance testing and the value of alternative approaches, such as reporting effect sizes and confidence intervals. " ... [T]he field may benefit from practices aimed at counteracting the single-minded drive toward achieving statistical significance," they said.
I couldn't find the effect sizes comment in the paper (but it may have been from conversations with the authors). I'm pretty sure all the journals surveyed reported effect sizes - point estimates of the magnitude of the effect. Perhaps they meant standardized effect size, but if so the advice is in my view doubly wrong. Standardized effect size has all sort of problems (see here and here for my take).
I won't repeat myself too much over this, but point out that standardized effect size generally obscures the thing you want to measure (as well as distorting it in several ways). That is the first reason why they aren't a solution to dodgy analyses - they will tend to make it harder for people to spot the problems in a study when there are meaningful, well-understood units (such as percentage recall, response times and so forth). They can maybe, sometimes help you interpret arbitrary measures - but generally only if you don't know the range of the scale. For instance, a shift of 3 points on a 7 point scale is huge and might make people suspicious whereas an effect size estimate of d = .6 or r = .4 might not.
The second reason is that effect size in general (and standardized effect size in particular) is at least as vulnerable as p values to publication bias, optional stopping and all the other known problems with statistical significance as routinely practiced. Combine small studies with a threshold for publication (e.g., p < .05 or d > 0.5) and effect size is substantially biased upwards (the bias is lower for large studies). Standardized effect size makes this worse because small samples tend to underestimate population variance and thus inflate the effect size estimate yet further. The final straw is that standardized effect size is easy to game. For instance, Cohen's d type metrics for a paired design come in at least four flavours (removing between-subject variance or not, crossed with a pooled or non-pooled error standard deviation). Removing the between-subject effect increases the reported d value artificially relative to an independent subjects design (even though the magnitude of the effect is generally assumed not to change). A similar problem occurs with variance explained measures such as eta-squared. People generally report partial eta-squared inappropriately in preference to classical eta-squared (which is generally smaller except in very simple designs).*
So, effect size is a red herring. The point on confidence intervals (and interval estimates in general) is a good one.
The piece at The Scholarly Kitchen (TSK) annoyed me a little. The tone implied that the problem with p values a problem in psychology. It isn't. It is a problem in any field that obsesses about p values (including most quantitative work in social sciences, medicine and science). I'm prepared to bet you'd get the same patterns in medical research (and they could well be much worse given some of the pressures to publish positive results in that area). More interesting TSK notes that JEP: General fared better than the other two journals looked at. TSK speculates that this is a result of better editing and reviewing. As it is the only one of the three journals I have reviewed for, I am disposed to agree with this. However, I suspect it may simply be that JEP: General publishes papers with methods that dilute the bias. In particular, I think it publishes more papers that focus on model fit and hence look for p > .05 more often.
The QJEP paper also mentions replicability statistics in passing. Another red herring, I think. Replicability can't be reliably estimated from small samples (see here).
* Actually, if you have to report eta-squared measures you should use generalized eta-squared (see Olejnik & Algina, 2003; Baguley, 2012).