Thursday, February 16, 2012

Simulating p curves and detecting dodgy stats

Psych your mind has an interesting blog post on using p curves to detect dodgy stats in a a volume of published work (e.g., for a researcher or journal). The idea apparently comes from Uri Simonsohn (one of the authors of a recent paper on dodgy stats). The author (Michael W. Kraus) bravely plotted and published his own p curve - which looks reasonably 'healthy'. However, he makes an interesting point - which is that we don't know how useful these curves are in practice - which depends among other things on the variability inherent in the profile of p values.

I quickly threw together a simulation to address this in R. It is pretty limited (as I don't have much time right now), but potentially interesting. It simulates independent t test p values where the samples are drawn from independent, normal distributions with equal variances but different means (and n = 25 per group). The population standardized effect size is fixed at d = 0.5 (as psychology research generally reports median effect sizes around this value). Fixing the parameters is unrealistic, but is perhaps OK for a quick simulation.

I ran this several times and plotted p curves (really just histograms with bins collecting p values at relevant intervals). First I plotted for an early career researcher with just a few publications reporting 50 p values. I then repeated for more experienced researchers with n = 100 or n = 500 published p values.

Here are the 15 random plots for 50 p values:


At least one of the plots has a suspicious spike between p = .04 and .05 (exactly where dodgy practices would tend to push the p values).

What about 100 p values?


Here the plots are still variable (but closer to the theoretical ideal plotted on Kraus' blog).

You can see this pattern even more clearly with 500 p values:


Some quick conclusions ... The method is too unreliable for use with early career researchers. You need a few hundred p values to be pretty confidence of a nice flat pattern between p = .01 and p = .06. Varying the effect size and other parameters might well inject further noise (as would adding in null effects which have a uniform distribution of p values and are thus probably rather noisy).

I'm also skeptical that this is useful for detecting fraud (as presumably deliberate fraud will tend to go for 'impressive' p values such as p < .0001). Also (going forward) fraudsters will be able to generate results to circumvent tools such as p curves (if they are known to be in use).




Tuesday, February 14, 2012

On nonparametric statistics ...

I'm not a big fan of the term "nonparametric statistics", or at least how it is used in psychology and related fields (e.g., education and health research). This is one reason why I don't make a big deal of the parametric/non-parametric distinction in my  Serious stats book and probably partly why a recent article in APS Observer annoyed me so much.


Why did it annoy me so much? It says, in essence, what many standard psychology text books say (and also makes several good points - which I shall largely ignore for rhetorical purposes). I like in particular the points about statistical tests used to check statistical assumptions lacking power and making statistical assumptions themselves (which are typically unchecked and they are not robust to). Indeed, that's another thing that annoys me and I have blogged/bored people about in the past (e.g., see here). I also think the article might have been chopped about a bit by editors (as it contains at least one important reference not cited and some of my gripes are also contained in references cited by the article I'm annoyed with). So, on balance, it is (by psychology mainstream standards) not at all a bad piece. However, by buying into the standard nonparametric stats presentation it perpetuates a few myths or errors and may inadvertently gives some pretty poor advice mixed in with the good.


Here is a quick list of my criticisms:

(1) The definition of nonparametric statistics is deeply confused. The author starts by writing: "Nonparametric statistical analyses are used to investigate research questions in which the dependent variable is ranked or categorical rather than quantified in a true numeric sense". This seems to suggest that non-parametric procedures are defined by having a discrete or bounded DV. Later it adds that "Traditional parametric statistics require a number of assumptions about the characteristics (i.e., parameters) of the data." This is an appeal to the idea that parametric statistics assume a particular probability distribution (the parameters of which are estimated by the data). This seems like a better definition to me, although like many psychologists, the author appears to assume that the probabilty distribution assumed is always a normal distribution. Mixing the two aspects of the definition is confusing. It is easy to find statistical procedures that are parametric in the second sense, but involve ranked or categorical DVs. I would argue that a chi-square test of independence or a sign test is parametric by the second definition. Not only is the definition problematic, but it could lead to poor analytic decisions. For instance, dichotomous outcomes are often best analyzed using parametric techniques such as logistic regression (a generalized linear model with a logistic link function and a binomial random component) rather than the methods surveyed in the article.

(2) The article also reads as if nonparametric tests are the same thing as rank randomization tests. Rank randomization tests are examples of nonparametric tests in the sense they are distribution free (making no assumption about a particular probability distribution for the data). However, there are many other nonparametric methods that don't involve ranks. Rank randomization tests are useful but limited in scope. In addition, if the raw data are in the form of ranks then a rank transformation is pretty pointless (you might as well jut run a regular test).

The main limitation of the rank approaches is that the rank transformation is irreversible; it destroys the link between the analysis and the raw scores. This is a serious problem if you care about the raw scores - perhaps because you want to test whether effects are non-additive or because you want to get an interval estimate or a effect sizes on the original scale. There are ways to help you do this with rank transformation procedures, but they are generally pretty fiddly. There are good reasons why people don't tend to use rank transformations in multiple regression, use them to test interaction effects or use them to construct confidence intervals.

Third, rank randomization (and other nonparametric) procedures are bound by pesky assumptions! They tend to make weaker assumptions than parametric procedures, but all statistical procedures make assumptions. The assertion that the "main assumptions of nonparametric tests are that the dependent variable should be continuous and have independent random sampling, which means that nonparametric statistics do not require assumptions of homogeneity of variance and normality" is misleading. The precise assumptions vary from test to test and with the hypothesis being tested. As a rule, rank randomization and related rank transformation tests assume that samples with similar shapes of distribution are being compared. They can therefore be undermined by heterogeneity of variance or varying degrees of skew. If the distributions have very dissimilar shapes then the tests can sometimes behave very strangely (e.g., it is possible to get outcomes where A > B, B > C and yet C > A). Continuity is also not necessarily a requirement of the DV or of the underlying construct being measured (though it may be for some hypotheses).

(4) Many generalizations are made that simply don't hold up: "When the data violates the assumptions of a parametric test, nonparametric tests are again the more powerful analytic technique (Siegel, 1957)." This doesn't follow. Often this is true, but not always. There has been a lot of research on this topic since 1957 and parametric tests are not always inferior. This is particularly true if you don't restrict parametric techniques to t tests, linear regression and ANOVA. Some nonparametric techniques are known to have very poor power. I also got irritated by the critique of log and similar transformations that states: "while these transformations can make variables more normally distributed, they can also diminish or alter experimental effects, which can reduce power." To the extent this is true, it is also true of the rank transformation (more so in some cases). Furthermore, the real problem is with arbitrary transformations. Log transformations - where appropriate - tend to aid interpretation of effects (e.g., by quantifying them as proportionate rather than additive effects). I also dislike the implication that "experimental effects" exist in a pristine form prior to transformation. This is simply not the case - how to quantify and scale measurement of an effect is a tricky business (e.g., memory researchers can use percentage correct, hits minus false alarms, A prime d prime etc.) and many effects come "pre-transformed" (e.g., measurement of loudness in decibels). Transformations (including the rank transformation) are useful tools that can increase power and aid interpretation if used carefully and appropriately.

(5) The use of parametric statistics to analyze Likert-style rating scales may be one of the "seven deadly sins of statistical analysis", but it is rarely a big problem in practice. Least squares methods are most messed up by heavy tailed distributions, severe skew or outliers. If anything, Likert-style rating scales tend not to have these problems (or to manifest them relatively mildly). Furthermore, where there are problems with Likert-style measures, rank randomization or transformation tests s are probably not the solution. A number of parametric procedures for ordinal outcomes exist - notably ordered logistic regression (though least squares methods such as t tests, ANOVA or regression should work well when their assumptions are not badly violated).

(6) The variety of nonparametric tests referred to is slightly artificial. As a general rule there are advantages to sticking to standard 'parametric' tests such as the Welch-Satterthwaite t test or one-way ANOVA rather than using named rank-transformation tests such as the Mann-Whitney U test. In some cases there may be advantages with specialized rank randomization tests where sample sizes are small (e.g., because software such as R implements exact versions). However, there are a few cases where the rank randomization tests are not robust (e.g., Mann-Whitney U test is not robust to heterogeneity of variance) or lack power (e.g., the Friedman test, Page's L test and most multiple comparison procedures available for ranks). Rank transforming the data and then running a t test with Welch-Satterthwaite correction is superior to running the Mann-Whitney directly (Zimmerman & Zumbo, 1993a). For more on the low power of the Friedman test and better alternatives see here.
Rather than think in terms of nonparametric statistics, it is better to focus on checking assumptions (using graphical methods and simple descriptives) and checking our models against more robust procedures. If more robust methods show different results - the next step is to find out why (and definitely not report just the outcome you prefer). This should lead you to a superior model (using robust methods or perhaps a more appropriate parametric model). The consideration of robust methods is particularly important. This includes some rank transformation tests, but also includes robust regression, bootstrapping and other tools (e.g., see Wilcox & Keselman, 2003; 2004; Baguley, 2012).


References


Baguley, T. (2012, in press). Serious stats: A guide to advanced statistics for the behavioral sciences. Basingstoke: Palgrave.


Wilcox, R. R., & Keselman, H. J. (2003). Modern robust data analysis methods: Measures of central tendency. Psychological Methods, 8, 254-274.

Wilcox, R. R.,&  Keselman, H. J. (2004). Robust regression methods: Achieving small standard errors when there is heteroscedasticity. Understanding Statistics, 3, 349- 364.


Zimmerman, D. W., & Zumbo, B. D. (1993b). Rank transformations and the power of the Student t test and Welch t' test for non-normal populations with unequal variances. Canadian Journal of Experimental Psychology, 47, 523-539.


Sunday, February 05, 2012

Comparing correlations update

I have just published R code for calculating CIs for differences between correlations on the Serious stats book blog. This covers independent correlations (taken from chapter 6 of the book) and dependent correlations (new R code written as a supplement to chapter 6).

UPDATE on the update ...

I have also added an Excel spreadsheet that should match the R output (though the latter is probably more accurate and reliable).

Thursday, February 02, 2012

Serious Stats book and blog update

This is a quick update to announce my new blog Serious Stats. This is a companion to my forthcoming book of the same name:


Baguley, T. (2012, in press). Serious stats: A guide to advanced statistics for the behavioral sciences. Basingstoke: Palgrave.


I think is better to separate book specific content out from my regular posts (though in some cases this will be a bit fuzzy). I will also try and post short updates here when something relevant gets published on the blog for the book. 





Wednesday, February 01, 2012

More on "A problem of significance"

A longer version of my earlier post A problem of significance just appeared in The Psychologist.


Baguley, T. (2012). Can we be confident in our statistics? The Psychologist. 25, 128-9.


Shortly after publication I received an email asking about statistical analysis of differences in correlations. This is more tricky than you might think. I'm working on some R code to implement one of the better approaches and plan to blog on this shortly ...


(See update here.)

Bookdepository

Quidco link

I love Quidco cashback