I've been thinking about p values quite a bit recently - prompted by a other bloggers and by some journal work. One interesting phenomenon in this area is the cliff effect: a supposed abrupt shift in researchers' confidence in an effect when moving from p > .05 to p < .05 (assuming that alpha is set at .05, as it usually is).
The p value cliff effect is interesting for a number of reasons (e.g., as a possible cause of dichotomous thinking about effects - see Rosenthal & Gaito, 1963). Personally (e.g., Baguley, 2012a), I think that the tendency to dichotomous thinking is more subtle and complex - and not just a consequence of the cliff effect. Furthermore, there is some doubt about the prevalence of the cliff effect. Poitevineau and Lecoutre (2001) argued that there was a clear cliff effect in only a minority of participants (4 out of 18 in their sample of researchers with PhDs). The appearance of a cliff in the full sample came from averaging over a mix of patterns among the participants - with, for instance, some showing linear increases and some exponential increases in confidence as p decreased.
My own musings on this are about whether there is a 'correct' pattern of confidence in an effect as p decreases. I'm not sure a sensible answer is possible, but I think it is possible to have a stab at answering if you make a few assumptions. First, assume that confidence is driven by the evidential value of a p value. This allows us to draw on the likelihood principle (and hence have some mathematical and philosophical basis for the analysis). I'm not going to explain likelihood or the likelihood principle here - though there is a brief introduction in Chapter 11 of Serious Stats (Baguley, 2012b).
Proceeding on this basis, you hit a major stumbling block. The evidential value in the data is contained in the likelihood function - but the p value doesn't give us the likelihood function (at least, not straight-forwardly). Furthermore, the law of likelihood implies that you can't get the evidence for an effect from either p or the likelihood. Evidence for a specific hypothesis (e.g., a null hypothesis such as mu = 0) can only be assessed relative to a specific alternative hypothesis (e.g., mu = 10). So even if you could get at the likelihood function you'd need to know what alternative hypothesis each researcher had in mind when they evaluated p. To make the problem tractable you could assume that the assume the alternative hypothesis at the most likely value (the maximum likelihood estimate). For simple situations this boils down to the observed parameter estimate of interest (e.g., a mean or difference in means). This is helpful because you can go from a p value derived from a z statistic to an approximate maximum likelihood ratio fairly easily (with additional assumptions about the asymptotic distribution of the test statistic).* The maximum likelihood ratio is the largest possible ratio of the evidential support in favour of the alternative hypothesis relative to the null hypothesis (of zero effect) from the data at hand (ignoring support from other sources). This is the most favourable assessment of the evidence for the alternative hypothesis and against the null on the basis of the data alone (and true evidential support is almost certain to be lower).
Plotting the maximum likelihood ratio as a function of p gives the following curve:
It flattens out when p > .20, so I've just shown the 'curvy' bit. Of course there is no particular reason why you'd use likelihood ratio ratio than say probability of the null hypothesis being false:
This shows a more gradual curve. Yet another possibility is the logarithm of the maximum likelihood ratio (commonly used to index evidential support in statistics):
Each of these is a plausible quantity to drive intuitions of evidential support (though you could easily arrive at different curves by making different - and perhaps more plausible - assumptions).
The upshot is that neither a fairly abrupt cliff effect, nor a smooth (near linear) function nor a steep exponential-style curve would be unreasonable. It is also clear that the task itself - assessing confidence in an effect - is far from trivial. Almost any continuous monotonic function could be considered as rational given the right assumptions and there is no particular reason to expect researchers to be able to do the right computations in their heads ...
On the other hand, a step function cliff effect does seem unreasonable - and it is definitely still interesting to understand the psychology of expert and non-expert statistical reasoning. However, we should be wary of assuming that people are being irrational or are bad at statistical reasoning on the basis of this kind of task.
Post script. This is all predicated on a few idle speculations, several highly debatable assumptions and a quick play with some R code. It is quite possible I've overlooked something major ...
* see Edwards et al. (1963) or Serious Stats Chapter section 11.4.2.
References
Baguley, T. (2012a). Can we be confident in our statistics? The Psychologist, 25, 128-9.
Baguley, T. (2012b). Serious stats: A guide to advanced statistics for the behavioral sciences. Basingstoke: Palgrave.
Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70, 193-242.
Poitevineau, J., & Lecoutre, B. (2001). Interpretation of significance levels by psychological
researchers: The .05 cliff effect may be overstated. Psychonomic Bulletin & Review, 8,
846-850.
Rosenthal, R. & Gaito, J. (1963). The interpretation of levels of significance by psychological researchers. Journal of Psychology, 55, 33-38.
Rosenthal, R. & Gaito, J. (1963). The interpretation of levels of significance bypsychological researchers. Journal of Psychology, 55, 3338.
Thursday, June 28, 2012
p values, the cliff effect and the nature of evidence
Thursday, June 21, 2012
The stimuli-as-fixed-effect fallacy
Neuroskeptic has just blogged on a new paper by Judd, Westfall and Kenny on Treating stimuli as a random factor in social psychology: A new and comprehensive solution to a pervasive but largely ignored problem. I can't access the original paper (which is supposed to be available via my University but hasn't appeared yet ...) but I know a little bit about the topic and thought I'd write a few words.
What stimulated me to write was a) a few of the comments on Neuroskeptic's blog, and b) that I've just written a book that covers the topic in some detail. (Yes - this book!).
The basic problem is that standard statistical analyses in psychology treat participants (subjects) as a random factor, but stimuli as a fixed factor. Thus our statistics assume that the goal of inference is to say something about some population that those participants are representative of (rather than just the particular people in our study). By treating stimuli as fixed it is assumed that we've exhaustively sampled the population of interest in our study. This limits statistical generalization to those particular stimuli. This is an unattractive property for psycholinguists (because they tend to be interested in, say, all concrete nouns rather than the 30 nouns used in the study). The same issue may apply to lots of other types of stimuli (faces, people, voices, pictures, logic problems and so forth).
The comments fell into several camps, but one response was that this was another case of researchers getting basic stats wrong. I consider this to be unfair because we're not talking basic stats here. The problem is quite subtle and the solutions are, in statistical terms, far from basic. Furthermore, it is not always an error. There are situations in which you don't need to worry about the problem and situations in which it is debatable what the correct approach is.
Another response was the psycholinguists have known about this problem for years (true!) and have analyzed their data correctly (false!). The problem came to prominence in a paper by Herb Clark (The language-as-fixed-effect fallacy), but was originally raised by Coleman (1964). Clark noted that running separate ANOVAs treating subjects as unit of analysis and items as unit of analysis did not solve the problem (by-subject and by-item analyses). Either analysis is statistically non-significant the effect fails to generalize, but if both are statistically significant the correct analysis (that combines variability across subjects and items) might still be statistically non-significant. His solution was to estimate the correct ANOVA test statistic (quasi F or F') with a simple-to-calculate minimum value (min F'). This is known to be conservative (i.e., produces p values that are slightly too large) but not unreasonably so in practice (see Raaijmakers et al., 1999). Raaijmakers et al. (1999) show that until recently most psycholinguistic researchers still got it wrong (e.g., by reporting separate by-item and by-subject analyses).
What is the correct approach? Well, it depends. First, do you need to generalize beyond your stimuli set? This has to do with your research goals. In some applied research you might just need to understand how people respond to a particular set of stimuli. A single stimulus or stimulus set can offer a counterexample to a strong claim (e.g., that X is always the case). Alternatively, it might be reasonable to assume that the stimuli are - for the purposes of the study - very similar to others in the population (i.e., that population variability is negligible). This might be the case for certain mass-produced products (e.g., brands of chocolate bar) or precision-engineered equipment. However, a lot of the time you do want to generalize beyond your sample of stimuli ...
That leaves you with the option of altering the design of the study or doing incorporating the extra variability owing to stimuli into the analysis. The design option was considered by Clark (1973) and by Raaijmakers et al. (1999). Clark pointed out that if each person had a different (ideally random) sample of items from the stimulus population then the F ratio of a conventional ANOVA would be correct. The principle here is quite simple: all relevant sources of variability need to be represented in the analysis. By varying the stimuli between participants the variability is present and ends up being incorporated into the between-subjects error term.* This is quite a neat method and can be easy to set up in some studies (e.g., if you have a very large pool of words to sample from by computer). Raaijmakers et al. (1999) also notes that you get the correct F ratios from certain other designs. This, in my view, is only partly true. Any design that restricts the population sampled from (of participants or stimuli) restricts its variability and therefore restricts its generalizability to the pool of participants or stimuli being sampled from.
Recent development in statistics and software (or at least recent awareness of them in psychology) have brought the discussion of the language-as-fixed-effect fallacy or more properly stimuli-as-fixed-effect fallacy back to prominence. In principle it is possible to use a multilevel (or linear mixed) model to deal with the problem of multiple random effects (and this has all sorts of other advantages). However, the usual default model is a nested model that implicitly assumes that stimuli presented to each person are different.
A nice point here is that a nested multilevel repeated measures model fitted with RML (restricted maximum likelihood) and a certain covariance structure (compound symmetry) is pretty much equivalent to repeated measures ANOVA and can be used to derive standard F tests etc. Thus Clark's assertion about using a design with stimuli nested within participants producing the correct F ratios is confirmed.
Baayen et al. (2008) offered a critique of the standard approach and explained how to fit a multilevel model with crossed random factors (i.e., where stimuli are the same for all participants ... or equivalently participants are the same for all stimuli). These models can be fit in software such as MLwiN or R (but not SPSS**) that allows for cross-classified multilevel. The lme4 package in R is particularly useful because it fits these models fairly effortlessly.
This looks to be the solution described by Judd, Westfall and Kenny - as far as I can tell by their abstract and the solution I cover in my book (Baguley, 2012).
* Note that a by-item analysis or by-subject analysis violates this principle because the each analysis uses the average response (averaged over the levels of the other random factor) and the variability around this average is unavailable to the analysis.
** UPDATE: Jake Westfall kindly sent me a copy of the paper. I have not read it properly yet but looks extremely good. He points out that recent versions of SPSS can run cross-classified models (I'm still on an older version). Their paper includes SPSS, R and SAS code. I would still recommend R over SPSS. One highlight is that show how to compute the Kenwood-Roger approximation in R. Complex multilevel models make it difficult to assess the correct df for effects and the Kenwood-Roger approximation is one of the better solutions. In my book I used parametric boostrapping or HPD intervals to get round this problem, but this is potentially a very useful addition.
* Note that a by-item analysis or by-subject analysis violates this principle because the each analysis uses the average response (averaged over the levels of the other random factor) and the variability around this average is unavailable to the analysis.
** UPDATE: Jake Westfall kindly sent me a copy of the paper. I have not read it properly yet but looks extremely good. He points out that recent versions of SPSS can run cross-classified models (I'm still on an older version). Their paper includes SPSS, R and SAS code. I would still recommend R over SPSS. One highlight is that show how to compute the Kenwood-Roger approximation in R. Complex multilevel models make it difficult to assess the correct df for effects and the Kenwood-Roger approximation is one of the better solutions. In my book I used parametric boostrapping or HPD intervals to get round this problem, but this is potentially a very useful addition.
References
Baayen, R. H., Davidson, D. J., & Bates, D. M.
(2008). Mixed-effects modeling with crossed random effects for subjects and
items. Journal of Memory & Language,
59, 390-412.
Clark, H. H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning and Verbal Behavior, 12, 335-359.
Coleman, E. B. (1964). Generalizing to a language population. Psychological Reports, 14, 219-226.
Raaijmakers, J. G. W., Schrijnemakers, J. M. C., & Gremmen, F. (1999). How to deal with "The language-as-fixed-effect fallacy": Common misconceptions and alternative solutions. Journal of Memory & Language, 41, 416-426.
Subscribe to:
Posts (Atom)


