Thursday, June 28, 2012

p values, the cliff effect and the nature of evidence

I've been thinking about p values quite a bit recently - prompted by a other bloggers and by some journal work. One interesting phenomenon in this area is the cliff effect: a supposed abrupt shift in researchers' confidence in an effect when moving from p > .05 to p < .05 (assuming that alpha is set at .05, as it usually is).


The p value cliff effect is interesting for a number of reasons (e.g., as a possible cause of dichotomous thinking about effects - see Rosenthal & Gaito, 1963). Personally (e.g., Baguley, 2012a), I think that the tendency to dichotomous thinking is more subtle and complex - and not just a consequence of the cliff effect. Furthermore, there is some doubt about the prevalence of the cliff effect. Poitevineau and Lecoutre (2001) argued that there was a clear cliff effect in only a minority of participants (4 out of 18 in their sample of researchers with PhDs). The appearance of a cliff in the full sample came from averaging over a mix of patterns among the participants - with, for instance, some showing linear increases and some exponential increases in confidence as p decreased.


My own musings on this are about whether there is a 'correct' pattern of confidence in an effect as p decreases. I'm not sure a sensible answer is possible, but I think it is possible to have a stab at answering if you make a few assumptions. First, assume that confidence is driven by the evidential value of a p value. This allows us to draw on the likelihood principle (and hence have some mathematical and philosophical basis for the analysis). I'm not going to explain likelihood or the likelihood principle here - though there is a brief introduction in Chapter 11 of Serious Stats (Baguley, 2012b).


Proceeding on this basis, you hit a major stumbling block. The evidential value in the data is contained in the likelihood function - but the p value doesn't give us the likelihood function (at least, not straight-forwardly). Furthermore, the law of likelihood implies that you can't get the evidence for an effect from either p or the likelihood. Evidence for a specific hypothesis (e.g., a null hypothesis such as mu = 0) can only be assessed relative to a specific alternative hypothesis (e.g., mu = 10). So even if you could get at the likelihood function you'd need to know what alternative hypothesis each researcher had in mind when they evaluated p. To make the problem tractable you could assume that the assume the alternative hypothesis at the most likely value (the maximum likelihood estimate). For simple situations this boils down to the observed parameter estimate of interest (e.g., a mean or difference in means). This is helpful because you can go from a p value derived from a z statistic to an approximate maximum likelihood ratio fairly easily (with additional assumptions about the asymptotic distribution of the test statistic).* The maximum likelihood ratio is the largest possible ratio of the evidential support in favour of the alternative hypothesis relative to the null hypothesis (of zero effect) from the data at hand (ignoring support from other sources). This is the most favourable assessment of the evidence for the alternative hypothesis and against the null on the basis of the data alone (and true evidential support is almost certain to be lower).


Plotting the maximum likelihood ratio as a function of p gives the following curve:


It flattens out when p > .20, so I've just shown the 'curvy' bit. Of course there is no particular reason why you'd use likelihood ratio ratio than say probability of the null hypothesis being false:






This shows a more gradual curve. Yet another possibility is the logarithm of the maximum likelihood ratio (commonly used to index evidential support in statistics):




Each of these is a plausible quantity to drive intuitions of evidential support (though you could easily arrive at different curves by making different - and perhaps more plausible - assumptions).


The upshot is that neither a fairly abrupt cliff effect, nor a smooth (near linear) function nor a steep exponential-style curve would be unreasonable. It is also clear that the task itself - assessing confidence in an effect - is far from trivial. Almost any continuous monotonic function could be considered as rational given the right assumptions and there is no particular reason to expect researchers to be able to do the right computations in their heads ...


On the other hand, a step function cliff effect does seem unreasonable - and it is definitely still interesting to understand the psychology of expert and non-expert statistical reasoning. However, we should be wary of assuming that people are being irrational or are bad at statistical reasoning on the basis of this kind of task.


Post script. This is all predicated on a few idle speculations, several highly debatable assumptions and a quick play with some R code. It is quite possible I've overlooked something major ...


* see Edwards et al. (1963) or Serious Stats Chapter section 11.4.2.


References



Baguley, T. (2012a). Can we be confident in our statistics? The Psychologist, 25, 128-9.


Baguley, T. (2012b). Serious stats: A guide to advanced statistics for the behavioral sciences. Basingstoke: Palgrave.


Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70, 193-242.


Poitevineau, J., & Lecoutre, B. (2001). Interpretation of significance levels by psychological
researchers: The .05 cliff effect may be overstated. Psychonomic Bulletin & Review, 8,
846-850.


Rosenthal, R. & Gaito, J. (1963). The interpretation of levels of significance by psychological researchers. Journal of Psychology, 55, 33-38.

Rosenthal, R. & Gaito, J. (1963). The interpretation of levels of significance bypsychological researchers. Journal of Psychology, 55, 3338.

No comments:

Post a Comment