Neuroskeptic has just blogged on a new paper by Judd, Westfall and Kenny on Treating stimuli as a random factor in social psychology: A new and comprehensive solution to a pervasive but largely ignored problem. I can't access the original paper (which is supposed to be available via my University but hasn't appeared yet ...) but I know a little bit about the topic and thought I'd write a few words.

What stimulated me to write was a) a few of the comments on Neuroskeptic's blog, and b) that I've just written a book that covers the topic in some detail. (Yes - this book!).

The basic problem is that standard statistical analyses in psychology treat participants (subjects) as a random factor, but stimuli as a fixed factor. Thus our statistics assume that the goal of inference is to say something about some population that those participants are representative of (rather than just the particular people in our study). By treating stimuli as fixed it is assumed that we've exhaustively sampled the population of interest in our study. This limits statistical generalization to those particular stimuli. This is an unattractive property for psycholinguists (because they tend to be interested in, say, all concrete nouns rather than the 30 nouns used in the study). The same issue may apply to lots of other types of stimuli (faces, people, voices, pictures, logic problems and so forth).

The comments fell into several camps, but one response was that this was another case of researchers getting basic stats wrong. I consider this to be unfair because we're not talking basic stats here. The problem is quite subtle and the solutions are, in statistical terms, far from basic. Furthermore, it is not always an error. There are situations in which you don't need to worry about the problem and situations in which it is debatable what the correct approach is.

Another response was the psycholinguists have known about this problem for years (true!) and have analyzed their data correctly (false!). The problem came to prominence in a paper by Herb Clark (The language-as-fixed-effect fallacy), but was originally raised by Coleman (1964). Clark noted that running separate ANOVAs treating subjects as unit of analysis and items as unit of analysis did not solve the problem (by-subject and by-item analyses). Either analysis is statistically non-significant the effect fails to generalize, but if both are statistically significant the correct analysis (that combines variability across subjects and items) might still be statistically non-significant. His solution was to estimate the correct ANOVA test statistic (quasi F or F') with a simple-to-calculate minimum value (min F'). This is known to be conservative (i.e., produces p values that are slightly too large) but not unreasonably so in practice (see Raaijmakers et al., 1999). Raaijmakers et al. (1999) show that until recently most psycholinguistic researchers still got it wrong (e.g., by reporting separate by-item and by-subject analyses).

What is the correct approach? Well, it depends. First, do you need to generalize beyond your stimuli set? This has to do with your research goals. In some applied research you might just need to understand how people respond to a particular set of stimuli. A single stimulus or stimulus set can offer a counterexample to a strong claim (e.g., that X is always the case). Alternatively, it might be reasonable to assume that the stimuli are - for the purposes of the study - very similar to others in the population (i.e., that population variability is negligible). This might be the case for certain mass-produced products (e.g., brands of chocolate bar) or precision-engineered equipment. However, a lot of the time you do want to generalize beyond your sample of stimuli ...

That leaves you with the option of altering the design of the study or doing incorporating the extra variability owing to stimuli into the analysis. The design option was considered by Clark (1973) and by Raaijmakers et al. (1999). Clark pointed out that if each person had a different (ideally random) sample of items from the stimulus population then the F ratio of a conventional ANOVA would be correct. The principle here is quite simple: all relevant sources of variability need to be represented in the analysis. By varying the stimuli between participants the variability is present and ends up being incorporated into the between-subjects error term.* This is quite a neat method and can be easy to set up in some studies (e.g., if you have a very large pool of words to sample from by computer). Raaijmakers et al. (1999) also notes that you get the correct F ratios from certain other designs. This, in my view, is only partly true. Any design that restricts the population sampled from (of participants or stimuli) restricts its variability and therefore restricts its generalizability to the pool of participants or stimuli being sampled from.

Recent development in statistics and software (or at least recent awareness of them in psychology) have brought the discussion of the language-as-fixed-effect fallacy or more properly stimuli-as-fixed-effect fallacy back to prominence. In principle it is possible to use a multilevel (or linear mixed) model to deal with the problem of multiple random effects (and this has all sorts of other advantages). However, the usual default model is a nested model that implicitly assumes that stimuli presented to each person are different.

A nice point here is that a nested multilevel repeated measures model fitted with RML (restricted maximum likelihood) and a certain covariance structure (compound symmetry) is pretty much equivalent to repeated measures ANOVA and can be used to derive standard F tests etc. Thus Clark's assertion about using a design with stimuli nested within participants producing the correct F ratios is confirmed.

Baayen et al. (2008) offered a critique of the standard approach and explained how to fit a multilevel model with crossed random factors (i.e., where stimuli are the same for all participants ... or equivalently participants are the same for all stimuli). These models can be fit in software such as MLwiN or R (but not SPSS**) that allows for cross-classified multilevel. The lme4 package in R is particularly useful because it fits these models fairly effortlessly.

This looks to be the solution described by Judd, Westfall and Kenny - as far as I can tell by their abstract and the solution I cover in my book (Baguley, 2012).

* Note that a by-item analysis or by-subject analysis violates this principle because the each analysis uses the average response (averaged over the levels of the other random factor) and the variability around this average is unavailable to the analysis.

** UPDATE: Jake Westfall kindly sent me a copy of the paper. I have not read it properly yet but looks extremely good. He points out that recent versions of SPSS can run cross-classified models (I'm still on an older version). Their paper includes SPSS, R and SAS code. I would still recommend R over SPSS. One highlight is that show how to compute the Kenward-Roger approximation in R. Complex multilevel models make it difficult to assess the correct df for effects and the Kenward-Roger approximation is one of the better solutions. In my book I used parametric boostrapping or HPD intervals to get round this problem, but this is potentially a very useful addition.

References


Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory & Language, 59, 390-412.




Clark, H. H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning and Verbal Behavior, 12, 335-359.

Coleman, E. B. (1964). Generalizing to a language population. Psychological Reports, 14, 219-226.

Raaijmakers, J. G. W., Schrijnemakers, J. M. C., & Gremmen, F. (1999). How to deal with "The language-as-fixed-effect fallacy": Common misconceptions and alternative solutions. Journal of Memory & Language, 41, 416-426.


0

Add a comment

I have been thinking to write a paper about MANOVA (and in particular why it should be avoided) for some time, but never got round to it. However, I recently discovered an excellent article by Francis Huang that pretty much sums up most of what I'd cover. In this blog post I'll just run through the main issues and refer you to Francis' paper for a more in-depth critique or the section on MANOVA in Serious Stats (Baguley, 2012).
2

I wrote a brief introduction to logistic regression aimed at psychology students. You can take a look at the pdf here:  

A more comprehensive introduction in terms of the generalised linear model can be found in my book:

Baguley, T. (2012). Serious stats: a guide to advanced statistics for the behavioral sciences. Palgrave Macmillan.

I wrote a short blog (with R Code) on how to calculate corrected CIs for rho and tau using the Fisher z transformation.

I have written a short article on Type II versus Type III SS in ANOVA-like models on my Serious Stats blog:

https://seriousstats.wordpress.com/2020/05/13/type-ii-and-type-iii-sums-of-squares-what-should-i-choose/

I have just published a short blog on the Egon Pearson correction for the chi-square test. This includes links to an R function to run the corrected test (and also provides residual analyses for contingency tables).

The blog is here and the R function here.

Bayesian Data Analysis in the Social Sciences Curriculum

Supported by the ESRC’s Advanced Training Initiative

Venue:           Bowden Room Nottingham Conference Centre

Burton Street, Nottingham, NG1 4BU

Booking information online

Provisional schedule:

Organizers:

Thom Baguley   twitter: @seriousstats

Mark Andrews  twitter: @xmjandrews

The third and (possibly) final round of the workshops of our introductory workshops was overbooked in April, but we have managed to arrange some additional dates in June.

There are still places left on these. More details at: http://www.priorexposure.org.uk/

As with the last round we are planning a free R workshop before hand (reccomended if you need a refresher or have never used R before).

In my Serious Stats blog I have a new post on providing CIs for a difference between independent R square coefficients.

You can find the post there or go direct to the function hosted on RPubs. I have been experimenting with knitr  but can't yet get the html from R Markdown to work with my blogger or wordpress blogs.
1
Links
Blog Archive
Subscribe
Subscribe
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.