It never occurred to me until today to write a post about why faking data is bad. However, I noticed an interesting exchange on Andrew Gelman's blog (see the comments on this post about Marc Hauser). One commenter argued that it was not clear that Hauser had faked his data (though I don't think that plausible given the results of the investigations and Hauser's dismissal from Harvard), and - more interestingly - that any data fraud was not serious because his supposedly fraudulent work has been replicated. This argument is in my opinion deeply flawed.

Andrew Gelman's response was:
To a statistician, the data are substance, not form.
I would generalize that to all of science. We'd certainly be better of thinking about data collection and analysis as integral to doing science rather than merely a necessary step in publishing papers, getting tenure or generating outputs for government research assessments.

Replication in this context just means getting the direction of effect correct. When you fake data you mess up the scientific record in multiple ways. A replication doesn't solve this or remove the distortion. For instance a common problem in meta-analysis is that people republish the same data two or more times (e.g., writing it up for different journals or through publishing interim analyses or salami slicing). This can be very hard to spot through accidental or deliberate obscuring of data sources. The upshot is that any biases or quirks of the data are magnified in the meta-analysis. Publishing fake data is worse than this because the biases, quirks, effect sizes, moderator variables are made up. Even publishing an incorrect effect size could be hugely damaging. In fact, most problems with medical and other applied research are related to effect size rather than presence of an effect.

Furthermore, the replication defense (albeit flawed in practice) has additional problems. One is that the replication probably isn't independent of the fake result. It is hard to publish failed replications - and researchers will be more lenient in their criteria to decide that they have replicated an established effect (e.g., using a one tailed test or re-running a failed replication on the assumption that they were unlucky). The most obvious problem is that you can't be sure in advance that the effect is real unless you run the experiment in the first place. I have run several experiments that have failed to show an effect or have gone in the opposite direction from what I believe.

Faking data is a bad idea - even you are remarkably insightful (and undoubtedly Hauser was clever) - the real data are a necessary part of the scientific process. Making up data distorts the scientific record.



0

Add a comment

I have been thinking to write a paper about MANOVA (and in particular why it should be avoided) for some time, but never got round to it. However, I recently discovered an excellent article by Francis Huang that pretty much sums up most of what I'd cover. In this blog post I'll just run through the main issues and refer you to Francis' paper for a more in-depth critique or the section on MANOVA in Serious Stats (Baguley, 2012).
2

I wrote a brief introduction to logistic regression aimed at psychology students. You can take a look at the pdf here:  

A more comprehensive introduction in terms of the generalised linear model can be found in my book:

Baguley, T. (2012). Serious stats: a guide to advanced statistics for the behavioral sciences. Palgrave Macmillan.

I wrote a short blog (with R Code) on how to calculate corrected CIs for rho and tau using the Fisher z transformation.

I have written a short article on Type II versus Type III SS in ANOVA-like models on my Serious Stats blog:

https://seriousstats.wordpress.com/2020/05/13/type-ii-and-type-iii-sums-of-squares-what-should-i-choose/

I have just published a short blog on the Egon Pearson correction for the chi-square test. This includes links to an R function to run the corrected test (and also provides residual analyses for contingency tables).

The blog is here and the R function here.

Bayesian Data Analysis in the Social Sciences Curriculum

Supported by the ESRC’s Advanced Training Initiative

Venue:           Bowden Room Nottingham Conference Centre

Burton Street, Nottingham, NG1 4BU

Booking information online

Provisional schedule:

Organizers:

Thom Baguley   twitter: @seriousstats

Mark Andrews  twitter: @xmjandrews

The third and (possibly) final round of the workshops of our introductory workshops was overbooked in April, but we have managed to arrange some additional dates in June.

There are still places left on these. More details at: http://www.priorexposure.org.uk/

As with the last round we are planning a free R workshop before hand (reccomended if you need a refresher or have never used R before).

In my Serious Stats blog I have a new post on providing CIs for a difference between independent R square coefficients.

You can find the post there or go direct to the function hosted on RPubs. I have been experimenting with knitr  but can't yet get the html from R Markdown to work with my blogger or wordpress blogs.
1
Links
Blog Archive
Subscribe
Subscribe
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.