I quickly threw together a simulation to address this in R. It is pretty limited (as I don't have much time right now), but potentially interesting. It simulates independent t test p values where the samples are drawn from independent, normal distributions with equal variances but different means (and n = 25 per group). The population standardized effect size is fixed at d = 0.5 (as psychology research generally reports median effect sizes around this value). Fixing the parameters is unrealistic, but is perhaps OK for a quick simulation.
I ran this several times and plotted p curves (really just histograms with bins collecting p values at relevant intervals). First I plotted for an early career researcher with just a few publications reporting 50 p values. I then repeated for more experienced researchers with n = 100 or n = 500 published p values.
Here are the 15 random plots for 50 p values:
At least one of the plots has a suspicious spike between p = .04 and .05 (exactly where dodgy practices would tend to push the p values).
What about 100 p values?
Here the plots are still variable (but closer to the theoretical ideal plotted on Kraus' blog).
You can see this pattern even more clearly with 500 p values:
Some quick conclusions ... The method is too unreliable for use with early career researchers. You need a few hundred p values to be pretty confidence of a nice flat pattern between p = .01 and p = .06. Varying the effect size and other parameters might well inject further noise (as would adding in null effects which have a uniform distribution of p values and are thus probably rather noisy).
I'm also skeptical that this is useful for detecting fraud (as presumably deliberate fraud will tend to go for 'impressive' p values such as p < .0001). Also (going forward) fraudsters will be able to generate results to circumvent tools such as p curves (if they are known to be in use).



Cool!
ReplyDeleteBut I am not sure that the fact that 1 in 50 p-curves obtained from legitimate research practice look a bit dodgy means that the measure is too noisy to give us a decent clue about whether someone is doing underpowered studies (which flattens the curve) or p-hacking around (which can produce the bunch-up around ~.04).
After all, what other indicators of an individual's hirability would we imagine produce less than 1 in 50 false alarms (whether we are talking about positive indicators or negative indicators)? All of the ones I can think of--eg glowing referee letters, pubs in top journals, reputation as a difficult personality--probably false alarm at rates more like 1 in 3 or 1 in 5. If they produced just 1 in 50 false alarms that would be great.
Sorry if it wasn't clear, but 50 is the number of p values in the sample (not the number of 'researchers'). The selection of plots was semi-random (and there are 15 per plot) so without further work it is 1 in 15 (but you'd need to check what triggers concern etc.). Also - as I noted - I'm almost certainly underestimating the noise. More so if dodgy practice is conditional on not getting good results (i.e., researchers won't manipulate clear, favourable results so they won't be doing it all the time).
ReplyDeleteI'm also not convinced about underpowering. Most psychology research is underpowered. This results in a massive 'p value filter' which would distort the ideal curve. It also suggests early career researchers will be advantaged by getting luck and finding large effects (or researching the 'obvious').
Lastly, a false alarm rate needs to be compared to the number of people being tested. So in a job interview situation with 10 people shortlisted (and p-curved) a 1/50 false alarm rate means that one or more candidate false alarms nearly 20% of the time. That might be acceptable for some criteria but not for something that is extemely prejudicial (i.e. a potential sign of fraud). It would also be illegal in the UK as the false alarm rate would be correlated with age of candidate (and hence indirect discrimination). You may be right about US 'hirability' as the law presumably varies from state to state. Our interview panels aren't (legally) allowed to entertain reputation as a difficult personality (or any other rep!), we don't use referee letters (except to confirm appointments). Also a cynic might argue that use of 'top' journals is a legitimate discriminator (because research assessment mainly about the journal and only tangentially the research*).
A better idea would be to explore p curves for journals (lots of p values and lots of control conditions). For instance - I edit a mainly non-empirical journal where there is relatively little incentive to manipulate p values (as p values are mainly for examples and publication won't depend on p < .05).
Sorry for grumpy reply - rather late ... nevertheless basing major decisions on noisy and biased data can't be good. Combining lots of noisy unbiased data sources would be better.
Echoing these concerns but in a somewhat different way, whether one in fifteen false alarms is a problem or not depends entirely on the base rate of p hacking within the population.
DeleteFor example, if one researcher in ten (3/30) is p hacking (and all researchers who phack show this pattern), and you get a false alarm one time out of fifteen (2/30), then there is only a 60% chance (3/5) that someone with a suspicious pattern of p values is p hacking. Better than chance, but not sure if it is enough evidence to surpass a reasonable doubt standard.
However, extending this logic... it should be possible to get an estimate of the number of apparent phackers that should exist due to chance, and then compare this to the actual distribution within a field... this might be informative of how serious a problem this is for the field as a whole.
Yes - I think it is more promising as a tool for a field or journal where there are sufficient data to overcome noise and maybe covary out any confounding factors.
DeleteGreat post about p-curves, and very interesting simulation results. Actually, I'm currently learning R, and was wondering if you could share with me the R code you used for your quick simulations to help me with my learning? If so, you can send it at elebel at uwo dot ca
ReplyDeleteThanks so much!
Regards, Etienne.
Good idea - I had meant to share this. I'll dig out the R code and post it later today if I have time.
DeleteSorry this took so long:
Deletehttp://psychologicalstatistics.blogspot.com/2012/03/r-code-for-p-curves.html
I suspect that in the paper about p-curves, the authors will not present p-curves as a method to detect frauds. It would be more accurate to present it as a method to detect unreliable effects (many reported studies with p's between .01 and .05, and hardly any with other p's) that were probably the result of p-hacking.
ReplyDeleteFor instance, it would be very interesting to create a p-curve for replications of automatic social behavior effects that are now in the center of another quarrel about replicability in social psychology (http://blogs.discovermagazine.com/notrocketscience/2012/03/10/failed-replication-bargh-psychology-study-doyen/).
It is very important to understand that it is most likely that psychologists (and other scientists) who engage in p-hacking have no idea that their findings are unreliable and they are definitely not performing intentional fraud.
I didn't see the original talk - and I look forwarding to finding out more about the method. I also agree that it is generally the stats that are dodgy and not the researchers!
DeleteIn my view, there may be several problems with the method.
ReplyDeleteFirst, while your simulation dealt with one size of an effect, it's much more probable that one psychologist (journal or whatever) would do studies on more effects - and thus effect sizes as well. Imagine that someone does studies on two effects. The first one has large actual effect size and would therefore lead to small p values (such as is the case in your simulation). The other one has smaller effect size and is therefore often underpowered. It may be the case that the studies would have such power that they would often lead to p values near 0.05. Now, when you combine these two respective p-curves, they may lead to p-curve that would have high number of p values on the left tail and second peak near p=0.05. It may look this way (numbers by 0.01 interval): 9,5,3,5,6,4,2,1... That may look very suspiciously (6 studies well below 0.05), but may be very easily result of data analysis that is well performed.
Second, I wonder what effect does publication bias have. Again data analysis of some scientist may be ok, but if he/she publishes in a field with publication bias (i.e. almost any), that would result in a p-curve that would look suspiciously due to the fact that studies with p values higher than 0.05 would remain unreported. That would most probably occur in a case that he or she often conducts underpowered studies (which is not good itself, but that is another story).
Finally, there are situations when results from different analyses are dependent - i.e. variables are correlated. Suppose that someone measures performance by one task and there is a difference between groups - p=0.049. If he or she measures performance by more than one task and those tasks are correlated the other tests would result in p close to 0.049 as well. Therefore, it would be necessary to involve just one p-value from a dependent set of analyses in a p-curve.
Other possible problems were already mentioned (large data sets necessary, possibilty of type I error). All in all, I am curious with what solutions would authors of the paper come. I am sceptical that the p-hacking would be very useful, but may be mistaken - after all I haven't seen the paper.
The simulation is was meant as a back of the envelope check - there are certainly other factors to at play (e.g., variability of effects and publication bias).
DeleteI'm doubtful that dependencies between p values are that much of an issue. There is simply too much noise - you can see that in the plots. These are all (in effect) identical replications of the same effect (with d = 0.5) but p is all over the place. p values are simply too unstable for correlated effect sizes to have much impact. In fact correlated p values might be a sign of p hacking. You'd probably have to weed out repeated tests of the same hypothesis with essentially the same variables - though depending on the reason for these one could argue that is a mild form of p hacking (something that is sensible in some circumstances but that can easily be abused in others).
I'd like to see more of the procedure and what useful information can be extracted from it.
Well that depends on the field. It's not unusual to see several tasks/measures assessing the same construct on the same subjects in behavioral neuroscience. Those measures are often correlated and these correlations are often not reported. And the correlations may be very high as well.
DeleteNevertheless, I think that the other two problems I mentioned are more serious. It's easy to deal with this particular problem by including only one p value in a p-curve as I wrote before (it limits usability of p-hacking, though).
Btw: I would be grateful if you posted the R code as well (for similar reason as the other commenter).
There are some interesting properties of p-curves (see Cumming, 2008 Psychological Science for details), but as the initial simulation shows, I'm not sure this approach is going to give clear answers because I do not know by what criteria we decide that a p-curve is unhealthy.
ReplyDeleteI've recently advocated a different approach. Although it has some connections to p-curves (which have a shape defined by experimental power), the idea more directly analyzes the believability of a set of experiments:
Francis, G. (2012). Too good to be true: Publication bias in two prominent studies from experimental psychology. Psychonomic Bulletin & Review, 19, 151-156. DOI 10.3758/s13423-012-0227-9
This approach does not usually work well for comparing experimental findings across different investigations, which (based on the limited discussions of p-curves I've seen) seems to be what people are trying to do. My intuition is that it is going to be difficult to make that work.
Thanks - I'll take a look. I have somewhat revised my opinion of p curves since learning more about the approach:
Deletehttp://psychologicalstatistics.blogspot.co.uk/2012/03/p-curves-revisited.html
Thom