Tuesday, February 14, 2012

On nonparametric statistics ...

I'm not a big fan of the term "nonparametric statistics", or at least how it is used in psychology and related fields (e.g., education and health research). This is one reason why I don't make a big deal of the parametric/non-parametric distinction in my  Serious stats book and probably partly why a recent article in APS Observer annoyed me so much.


Why did it annoy me so much? It says, in essence, what many standard psychology text books say (and also makes several good points - which I shall largely ignore for rhetorical purposes). I like in particular the points about statistical tests used to check statistical assumptions lacking power and making statistical assumptions themselves (which are typically unchecked and they are not robust to). Indeed, that's another thing that annoys me and I have blogged/bored people about in the past (e.g., see here). I also think the article might have been chopped about a bit by editors (as it contains at least one important reference not cited and some of my gripes are also contained in references cited by the article I'm annoyed with). So, on balance, it is (by psychology mainstream standards) not at all a bad piece. However, by buying into the standard nonparametric stats presentation it perpetuates a few myths or errors and may inadvertently gives some pretty poor advice mixed in with the good.


Here is a quick list of my criticisms:

(1) The definition of nonparametric statistics is deeply confused. The author starts by writing: "Nonparametric statistical analyses are used to investigate research questions in which the dependent variable is ranked or categorical rather than quantified in a true numeric sense". This seems to suggest that non-parametric procedures are defined by having a discrete or bounded DV. Later it adds that "Traditional parametric statistics require a number of assumptions about the characteristics (i.e., parameters) of the data." This is an appeal to the idea that parametric statistics assume a particular probability distribution (the parameters of which are estimated by the data). This seems like a better definition to me, although like many psychologists, the author appears to assume that the probabilty distribution assumed is always a normal distribution. Mixing the two aspects of the definition is confusing. It is easy to find statistical procedures that are parametric in the second sense, but involve ranked or categorical DVs. I would argue that a chi-square test of independence or a sign test is parametric by the second definition. Not only is the definition problematic, but it could lead to poor analytic decisions. For instance, dichotomous outcomes are often best analyzed using parametric techniques such as logistic regression (a generalized linear model with a logistic link function and a binomial random component) rather than the methods surveyed in the article.

(2) The article also reads as if nonparametric tests are the same thing as rank randomization tests. Rank randomization tests are examples of nonparametric tests in the sense they are distribution free (making no assumption about a particular probability distribution for the data). However, there are many other nonparametric methods that don't involve ranks. Rank randomization tests are useful but limited in scope. In addition, if the raw data are in the form of ranks then a rank transformation is pretty pointless (you might as well jut run a regular test).

The main limitation of the rank approaches is that the rank transformation is irreversible; it destroys the link between the analysis and the raw scores. This is a serious problem if you care about the raw scores - perhaps because you want to test whether effects are non-additive or because you want to get an interval estimate or a effect sizes on the original scale. There are ways to help you do this with rank transformation procedures, but they are generally pretty fiddly. There are good reasons why people don't tend to use rank transformations in multiple regression, use them to test interaction effects or use them to construct confidence intervals.

Third, rank randomization (and other nonparametric) procedures are bound by pesky assumptions! They tend to make weaker assumptions than parametric procedures, but all statistical procedures make assumptions. The assertion that the "main assumptions of nonparametric tests are that the dependent variable should be continuous and have independent random sampling, which means that nonparametric statistics do not require assumptions of homogeneity of variance and normality" is misleading. The precise assumptions vary from test to test and with the hypothesis being tested. As a rule, rank randomization and related rank transformation tests assume that samples with similar shapes of distribution are being compared. They can therefore be undermined by heterogeneity of variance or varying degrees of skew. If the distributions have very dissimilar shapes then the tests can sometimes behave very strangely (e.g., it is possible to get outcomes where A > B, B > C and yet C > A). Continuity is also not necessarily a requirement of the DV or of the underlying construct being measured (though it may be for some hypotheses).

(4) Many generalizations are made that simply don't hold up: "When the data violates the assumptions of a parametric test, nonparametric tests are again the more powerful analytic technique (Siegel, 1957)." This doesn't follow. Often this is true, but not always. There has been a lot of research on this topic since 1957 and parametric tests are not always inferior. This is particularly true if you don't restrict parametric techniques to t tests, linear regression and ANOVA. Some nonparametric techniques are known to have very poor power. I also got irritated by the critique of log and similar transformations that states: "while these transformations can make variables more normally distributed, they can also diminish or alter experimental effects, which can reduce power." To the extent this is true, it is also true of the rank transformation (more so in some cases). Furthermore, the real problem is with arbitrary transformations. Log transformations - where appropriate - tend to aid interpretation of effects (e.g., by quantifying them as proportionate rather than additive effects). I also dislike the implication that "experimental effects" exist in a pristine form prior to transformation. This is simply not the case - how to quantify and scale measurement of an effect is a tricky business (e.g., memory researchers can use percentage correct, hits minus false alarms, A prime d prime etc.) and many effects come "pre-transformed" (e.g., measurement of loudness in decibels). Transformations (including the rank transformation) are useful tools that can increase power and aid interpretation if used carefully and appropriately.

(5) The use of parametric statistics to analyze Likert-style rating scales may be one of the "seven deadly sins of statistical analysis", but it is rarely a big problem in practice. Least squares methods are most messed up by heavy tailed distributions, severe skew or outliers. If anything, Likert-style rating scales tend not to have these problems (or to manifest them relatively mildly). Furthermore, where there are problems with Likert-style measures, rank randomization or transformation tests s are probably not the solution. A number of parametric procedures for ordinal outcomes exist - notably ordered logistic regression (though least squares methods such as t tests, ANOVA or regression should work well when their assumptions are not badly violated).

(6) The variety of nonparametric tests referred to is slightly artificial. As a general rule there are advantages to sticking to standard 'parametric' tests such as the Welch-Satterthwaite t test or one-way ANOVA rather than using named rank-transformation tests such as the Mann-Whitney U test. In some cases there may be advantages with specialized rank randomization tests where sample sizes are small (e.g., because software such as R implements exact versions). However, there are a few cases where the rank randomization tests are not robust (e.g., Mann-Whitney U test is not robust to heterogeneity of variance) or lack power (e.g., the Friedman test, Page's L test and most multiple comparison procedures available for ranks). Rank transforming the data and then running a t test with Welch-Satterthwaite correction is superior to running the Mann-Whitney directly (Zimmerman & Zumbo, 1993a). For more on the low power of the Friedman test and better alternatives see here.
Rather than think in terms of nonparametric statistics, it is better to focus on checking assumptions (using graphical methods and simple descriptives) and checking our models against more robust procedures. If more robust methods show different results - the next step is to find out why (and definitely not report just the outcome you prefer). This should lead you to a superior model (using robust methods or perhaps a more appropriate parametric model). The consideration of robust methods is particularly important. This includes some rank transformation tests, but also includes robust regression, bootstrapping and other tools (e.g., see Wilcox & Keselman, 2003; 2004; Baguley, 2012).


References


Baguley, T. (2012, in press). Serious stats: A guide to advanced statistics for the behavioral sciences. Basingstoke: Palgrave.


Wilcox, R. R., & Keselman, H. J. (2003). Modern robust data analysis methods: Measures of central tendency. Psychological Methods, 8, 254-274.

Wilcox, R. R.,&  Keselman, H. J. (2004). Robust regression methods: Achieving small standard errors when there is heteroscedasticity. Understanding Statistics, 3, 349- 364.


Zimmerman, D. W., & Zumbo, B. D. (1993b). Rank transformations and the power of the Student t test and Welch t' test for non-normal populations with unequal variances. Canadian Journal of Experimental Psychology, 47, 523-539.


No comments:

Post a Comment