Friday, June 21, 2013

Why faking data is bad ...

It never occurred to me until today to write a post about why faking data is bad. However, I noticed an interesting exchange on Andrew Gelman's blog (see the comments on this post about Marc Hauser). One commenter argued that it was not clear that Hauser had faked his data (though I don't think that plausible given the results of the investigations and Hauser's dismissal from Harvard), and - more interestingly - that any data fraud was not serious because his supposedly fraudulent work has been replicated. This argument is in my opinion deeply flawed.

Andrew Gelman's response was:
To a statistician, the data are substance, not form.
I would generalize that to all of science. We'd certainly be better of thinking about data collection and analysis as integral to doing science rather than merely a necessary step in publishing papers, getting tenure or generating outputs for government research assessments.

Replication in this context just means getting the direction of effect correct. When you fake data you mess up the scientific record in multiple ways. A replication doesn't solve this or remove the distortion. For instance a common problem in meta-analysis is that people republish the same data two or more times (e.g., writing it up for different journals or through publishing interim analyses or salami slicing). This can be very hard to spot through accidental or deliberate obscuring of data sources. The upshot is that any biases or quirks of the data are magnified in the meta-analysis. Publishing fake data is worse than this because the biases, quirks, effect sizes, moderator variables are made up. Even publishing an incorrect effect size could be hugely damaging. In fact, most problems with medical and other applied research are related to effect size rather than presence of an effect.

Furthermore, the replication defense (albeit flawed in practice) has additional problems. One is that the replication probably isn't independent of the fake result. It is hard to publish failed replications - and researchers will be more lenient in their criteria to decide that they have replicated an established effect (e.g., using a one tailed test or re-running a failed replication on the assumption that they were unlucky). The most obvious problem is that you can't be sure in advance that the effect is real unless you run the experiment in the first place. I have run several experiments that have failed to show an effect or have gone in the opposite direction from what I believe.

Faking data is a bad idea - even you are remarkably insightful (and undoubtedly Hauser was clever) - the real data are a necessary part of the scientific process. Making up data distorts the scientific record.



No comments:

Post a Comment