Monday, May 22, 2006

What is all this stuff about sphericity in my repeated measures ANOVA output?

Until relatively recently most psychologists had never heard of sphericity. Most text books didn't mention it (and even now some of them confuse it with compound symmetry). The big change happened a few years ago when SPSS started including information about sphericity in its repeated measures ANOVA output. Of course, because most undergarduate text books didn't mention it people started looking for information on it. Around that time I taught an advanced course on psychological statistics and produced this resource for my tutees. It turned out that they passed the web link on to other students on the course. Once I realized it was widely used I tried to keep it fairly up-to-date until I moved jobs and my web pages became homeless. I keep meaning to improve the layout and a little more detail in places, but in the mean time I hope a few people find it useful ...


An Introduction to Sphericity in Repeated Measures ANOVA

The article is written for a general audience of post-graduate and graduate researchers. The technical material goes slightly beyond what is covered in most text books, although there is still some simplification (which is usually indicated in the text). The aim to give advice about best practice for checking and dealing with sphericity in repeated measures ANOVA. Some of the content is personal opinion (which I have tried to indicate in the text). I include a short bibliography of my sources at the end for readers who want to explore the topic in more detail.

Some background

Sphericity is a mathematical assumption in repeated measures ANOVA designs. Let's start by considering a simpler ANOVA design (e.g., one-way independent measures ANOVA).

In independent measures ANOVA one of the mathematical assumptions is that the variances of the populations that groups are sampled from are equal. This homogeneity of variance assumption more-or-less follows from the null hypothesis being tested in ANOVA - if the treatment has no effect on the thing being measured (the DV) then we can consider all the groups to be sampled from the same population.1 However, because we're taking samples we'd be very lucky to observe exactly equal variances (even if the assumption were perfectly met). Real data is rarely that neat! What we'd expect to get (most of the time) is groups with similar variances.2

The sphericity assumption can be thought of as an extension of the homogeneity of variance assumption in independent measures ANOVA. Why does the assumption need to be extended? To understand this we need to introduce the ANOVA covariance matrix.

The covariance matrix

What is a covariance matrix? In a nutshell, it is a matrix that contains the covariances between levels of a factor in an ANOVA design.3 A covariance is the shared or overlapping variance between two things (sometimes called variance in common). Let's look at an example of the layout for a one-factor ANOVA design with four levels and therefore four samples (called A1, A2, A3 and A4):


Samples:

A1

A2

A3

A4

A1

S12

S12

S13

S14

A2

S21

S22

S23

S24

A3

S31

S32

S32

S34

A4

S41

S42

S43

S42


The first thing to notice is the main diagonal cells in the matrix (running top left to bottom right)contain the variances of the four levels (e.g., S12 is the variance of A1).4 The second thing to notice is that the covariances are therefore in the cells off the main diagonals (called the off-diagonal cells and greyed out in the table above). The third thing to notice is that the covariances are mirrored above and below the main diagonal. (The term S14 is the covariance between samples A1 and A4, while S41 is the covariance between samples A4 and A1. As this is the variance they have in common, S14 = S41.)

What does the covariance matrix look like for independent measures ANOVA? Here an example for a one-way independent measures ANOVA design with 4 levels (and hence four groups):


Samples:

A1

A2

A3

A4

A1

S12

0

0

0

A2

0

S22

0

0

A3

0

0

S32

0

A4

0

0

0

S42


The most striking observation is that all the covariances are zero. Why? The answer is fairly straight-forward. In an independent measures design the observations should be independent and therefore uncorrelated with each other.5 Two samples that are uncorrelated will share no variance (and the covariance will be zero). So in this relatively simple case we only have to worry about homogeneity of variance - which would lead us to expect that the observed variances on the main diagonal should be similar.

Reminder: Assumptions such as homogeneity of variance, sphericity and so forth are assumptions about the populations we are sampling from. I'll try and indicate this as I go through, but it sometimes gets clumsy to keep repeating "in the population being sampled" all the time! We expect samples to have similar characteristics to the populations being sampled, but only in rare cases will the samples show exactly the same pattern of variance (or whatever) as the population. It is also worth adding that large samples are more similar to the populations they are sampled from than small samples.

Finally, please note that the statistical term "population" is an abstract one. We are referring to a population of data points that we might potentially be sampling not a fixed entity such as the population of a country. (In other contexts, such as market research, people sometimes deal with such fixed populations, but this requires slightly different methods from those in most scientific research).


What is the sphericity assumption?

Compound symmetry

The sphericity assumption is an assumption about the structure of the covariance matrix in a repeated measures design. Before we describe it in detail lets consider a simpler (but stricter) condition. This one is called compound symmetry. Compound symmetry is met if all the covariances (the off-diagonal elements of the covariance matrix) are equal and all the variances are equal in the populations being sampled. (Note that the variances don't have to equal the covariances.) Just as with the homogeneity of variance assumption we'd only rarely expect a real data set to meet compound symmetry exactly, but provided the observed covariances are roughly equal in our samples (and the variances are OK too) we can be pretty confident that compound symmetry is not violated.

The good news about compound symmetry

If compound symmetry is met then sphericity is also met. So if you take a look at the covariance matrix and the covariances are similar and the variances are similar then we know that sphericity is not going to be a problem.6

The bad news about compound symmetry

As compound symmetry is a stricter requirement than sphericity we still need to check sphericity if compound symmetry isn't met. This is where it gets technical (or should that be ... even more technical).

The sphericity assumption

Lets take a look at some raw data. Imagine that the first few observations of A1, A2, A3 and A4 are as follows:

A1

A2

A3

A4

Participant 1

8

9

12

4

Participant 2

6

11

16

3

Participant 3

9

8

12

5

etc.

...

...

...

...

For each possible pair of levels of factor A (e.g., A1 and A2 or A2 and A3) we can calculate the difference between the observations. For example:

A1- A2

A1- A3

A1- A4

etc.

Participant 1

-1

-4

+4

...

Participant 2

-5

-10

+3

...

Participant 3

+1

-3

+4

...

etc.

...

...

...

...

We could then calculate variances for each of these differences (e.g., S1-42). The sphericity assumption is that the all the variances of the differences are equal (in the population sampled). In practice, we'd expect the observed sample variances of the differences to be similar if the sphericity assumption was met.

Using the covariance matrix to check the sphericity assumption

We can check sphericity assumption using the covariance matrix, but it turns out to be fairly laborious. (Later on I'll discuss some simpler ways to check sphericity using output from SPSS and similar statistics packages). Variance of differences can be computed using a version of the variance sum law:

Sx-y2 = sx2 + sy2 - 2(sxy)


In other words the variance of a difference is the sum of the two variances minus twice their covariance. A simple arithmetic check will show that this works out as zero if the two variances share all their variance.

(Note that we could also calculate the variances of the differences directly from the raw data. We'd simply calculate the differences between all the possible pairs of levels of a factor. For example, using Excel or SPSS we could define a new column value as one level minus another level and then calculate the variances of each column using the built in descriptive statistics of the program. This would get very tedious if we had lots of levels, so I'd recommend the above method if you really want to calculate the variances of the differences and you already have a covariance matrix. Fortunately this isn't necessary in most cases - as I'll discuss later.)


An example

This example is adapted from Kirk (1995). Imagine the observed covariance matrix for a one-way repeated measures ANOVA design is this:

Samples:

A1

A2

A3

A4

A1

10

5

10

15

A2

5

20

15

20

A3

10

15

30

25

A4

15

20

25

40

The variances of the differences are:

Sx-y2 = Sx2 + Sy2 - 2(Sxy)

So ...

S1-22 = S12 + S22 - 2(S12) = 10 + 20 - 2(5) = 30 - 10 = 20

S1-32 = 10 + 30 - 2(10) = 20

S1-42 = 10 + 40 - 2(15) = 20

S2-32 = 20 + 30 - 2(15) = 20

S2-42 = 20 + 40 - 2(20) = 20

S3-42 = 30 + 40 - 2(25) = 20

This example has been contrived so that variances of the differences are exactly equal (which would be extremely unlikely in real data), but it does demonstrate that lack of compound symmetry does not necessarily mean that sphericity is violated. (Compound symmetry is a sufficient, but not necessary requirement for sphericity to be met.)7 In this example, compound symmetry is clearly not met (the largest variances and covariances are 4 or 5 times bigger than the smallest), but sphericity holds.


What to do if sphericity is violated in repeated measures ANOVA

There are two broad approaches to dealing with violations of sphericity. The first is to use a correction to the standard ANOVA tests. The second is to use a different test (i.e., one that doesn't assume sphericity).

In the following sub-sections I give general advice on what to do if sphericity is violated, this advice tends to hold well in most cases for factorial repeated measures designs but may be problematic for mixed ANOVA designs (discussed later under Complications).

Correcting for violations of sphericity

The best known corrections are those developed by Greenhouse and Geisser (the Greenhouse-Geisser correction) and Huynh and Feldt (the Huynh-Feldt correction).8 Each of these corrections works roughly in the same way. They all attempt to adjust the degrees of freedom in the ANOVA test in order to produce a more accurate significance (p) value. If sphericity is violated the p values
need to be adjusted upwards (and this can be accomplished by adjusting the degrees of freedom downwards).

The first step in each test is to estimate something called epsilon.9 For our purposes we can consider epsilon to be a descriptive statistic indicating the degree to which sphericity has been violated. If sphericity is met perfectly then epsilon will be exactly 1. If epsilon is below 1 then
sphericity is violated. The further epsilon gets away from 1 the worse the violation.

How bad can epsilon get? Well, it depends on the number of levels (k) on the repeated measures factor:

Lower bound of epsilon = 1/(k-1)


Therefore, epsilon can go as low as 1/(3-2) = 0.5 for three levels, as low as 0.2 for six levels and so forth. The more levels on the repeated measures factor the worse the potential for violations of sphericity.10

The three common corrections fall into a range of most to least strict. First consider the most strict. We could use the lower bound value of epsilon and correct for the worst possible case. Fortunately, there is a much better option. The Greenhouse-Geisser correction is a conservative correction (it tends to underestimate epsilon when epsilon is close to 1 and therefore tends to over-correct). Huynh-Feldt produced a modified version for use when the true value of epsilon is thought to be near or above 0.75.

The Huynh-Feldt correction tends to overestimate sphericity, so some statisticians have suggested using the average of the Greenhouse-Geisser and Huynh-Feldt corrections. My advice would be to consider the aims of the research and the relative cost of Type I and II errors. If Type I errors are considered more costly (especially if the estimates of epsilon fall below 0.75) then stick to the more conservative Greenhouse-Geisser correction.

Using the correction is fairly simple. Replace the treatment and error d.f. by (epsilon*d.f.). So an epsilon value of 0.6 would turn an F1,50 test into an F0.6,30 test.11 Using the these corrections seems to work well for relatively modest departures from 1 by epsilon or when sample sizes are small.


Using MANOVA

An alternative approach is to use a test that doesn't assume sphericity. In the case of repeated measures ANOVA this usually means switching to multivariate ANOVA (MANOVA for short). Some computer programs print out MANOVA automatically alongside repeated measures ANOVA (SPSS is one of these). While this can be confusing, it does make it easy to compare results for different tests and corrections. If sphericity is met (i.e., epsilon = 1) all the p values for a given test should be identical. The degree to which they differ can be informative. If there is a wide discrepancy between different tests or correction then this suggests that the sphericity assumption may be severely violated and that one of the more conservative tests should be reported (e.g., Greenhouse-Geisser corrected F or MANOVA).

In general MANOVA is probably less powerful than repeated measures ANOVA and therefore should probably be avoided. However, when some experts suggest sample sizes are reasonably large (n exceeds 10 + k) and epsilon is low (less than 0.7) MANOVA will typically be more powerful and should probably be preferred. The most up-to-date evidence suggests that the precise circumstances under which MANOVA is more powerful depend on the properties of the covariance matrix and k in ways that are hard to predict.

My view is that most psychological research should avoid MANOVA and prefer sphericity assumed or corrections based on Huynh-Feldt or Greenhouse-Geisser epsilon estimates (especially when multisample sphericity - see below - is an issue). This assumes two things. First, that the important tests can not be reduced to a 1 effect d.f. contrasts with a non-pooled error term (see Multiple comparisons and contrasts below). Second, that psychologists tend to deal with relatively small k and small N situations. I don't think that there is much evidence that where MANOVA is more powerful that it is much more powerful than ANOVA for the modest values of n or k that appear in most of the psychological literature. I suspect that greater gains in power would be achieved by looking at other aspects of the study in most cases (e.g., see Baguley, 2004). If high quality pilot data were available it would be possible to estimate the the power of MANOVA or ANOVA in this situation (e.g., see Miles, 2003). However, I suspect that a small pilot sample would probably not estimate the population covariance with sufficient precision to be confident in this approach (this is not a flaw in the approach, but merely the observation that there is error in the estimates of all parameters in a sample - not just for the mean).

How to check sphericity

In this section I will focus on information readily available in SPSS (and most good statistics packages).

Factorial repeated measures ANOVA

If there is more than one repeated measures factor consider each factor separately. (See also Special cases: factors with 2 levels.)


A warning about Mauchly's sphericity test

Many text books recommend using significance tests such as Mauchly's to test sphericity. In general this is a very bad idea.

Why? First, tests of statistical assumptions - and Mauchly's is no exception - tend to lack statistical power (they tend to be bad at spotting violations of assumptions when N is small). Second, tests of statistical assumptions - and, again, Mauchly's is no exception - tend not to be very robust (unlike ANOVA and MANOVA they are poor at coping with violations of assumptions such as normality). Third, significance tests don't reveal the degree of violation (e.g., with very large N even a poor test like Mauchly's will show significance if there are very minor violations of sphericity; with low N the poor power means that even severe violations may not be detected). Fourth, significance tests of assumptions tend to be used as substitutes for looking at the data - if you followed the advice of many popular introductory texts you'd never look at the descriptive statistics at all (e.g., the variances, the covariance matrix, estimates of epsilon and so forth). Fifth, I just don't like them.12


Using Mauchly's sphericity test

The test principle is fairly simple. The null hypothesis is that sphericity holds (I like to think of it as a test that the true value of epsilon = 1). A significant result indicates evidence that sphericity is violated (i.e., evidence that the true value of episilon is below 1).

Epsilon

I would recommend using estimates of epsilon to decide whether sphericity is violated. If epsilon is close to 1 then it is likely that sphericity is intact (or that any violation is very minor). If epsilon is close to the lower bound (see Correcting for violations of sphericity above) then a correction or alternative procedure is likely to be necessary. Exactly where to draw the line is a matter of personal judgement, but it is often instructive to compare p values for the corrected and uncorrected tests. If they are fairly similar then there is little indication that sphericity is violated. If the the discrepancy is large then one of the corrections (or possibly MANOVA) should probably be used.

The covariance matrix

If estimates of epsilon are not readily available then lower-bound procedures can be used (see above) or the covariance matrix can be consulted. If compound symmetry holds then it is safe to proceed with repeated measures ANOVA. If compound symmetry does not hold it is relatively simple (if time-consuming) to calculate the variances of the differences for each factor from the covariance matrix.


Complications

Special cases: factors with 2 levels (and the paired t test)

If k = 2 (a repeated measures factor with only two levels) then the sphericity assumption is always met. Using the lower-bound formula one can see that when k = 2 epsilon can't be lower than 1/(k-1) = 1/(2-1) = 1. This is also true for the paired t test (in effect a one-way repeated measures ANOVA where k = 2).

Why isn't sphericity a problem when there are only two levels? Well, think about the covariance matrix:

Samples:

A1

A2

A1

S12

S12

A2

S21

S22

There are two covariances s21 and s12. The covariances above and below the main diagonal are constrained to be equal (because the shared variance between level 1 and level 2 is the same thing as the shared variance between level 2 and level 1). In effect there is only one covariance. Similarly, if we calculated s1-22 (the variance of the difference) we should realize there is, in effect, only one such variance. Sphericity is met if all the variances of the differences are equal. As there is only one, it can't not be equal to itself. For this reason Mauchly's sphericity test can't be computed if d.f. = 1 (i.e, if k = 2) and some computer programs give confusing messages or printouts if you try. This sometimes leads people to conclude that sphericity is violated when sucah a violation would be impossible.

Note that sphericity subsumes the standard homogeneity of variance assumption. In effect, we are only interested in the variances of the differences. When k = 2 there is only one variance of the difference between levels and we can ignore differences in the 'raw' level variances themselves.


Multiple comparisons and contrasts


Bonferroni t tests are frequently recommended for repeated measures ANOVA (whether or not sphericity is violated). The Bonferroni correction relies on a general probability inequality and therefore isn't dependent on specific ANOVA assumptions. As Bonferroni corrections tend to be conservative, a number of modified Bonferroni procedures have been proposed. Some are specific to certain patterns of hypothesis-testing, but others such as Holm's test (or the equivalent Larzelere and Mulaik test) are more powerful than standard Bonferroni corrections and should be used more widely (and not just for ANOVA). The superficially similar Hochberg procedure is usually much more powerful than Bonferroni and modified Bonferroni procedures and might also be considered. The Hochberg procedure is radically different in philosophy from standard multiple comparison procedures (though similar to the Holm procedure in its computation) and should be considered carefully before adopted).

Experts recommend specific (rather than pooled) error terms for repeated measures factors (i.e., calculate the SE for t using only the conditions being compared, rather than using the square root of the MSE term from the ANOVA table to derive the pooled SD and hence compute a pooled SE).

This advice also extends to contrasts which can be easily calculated by performing paired t tests on weighted averages of the appropriate means. Using a specific error term should avoid problems with sphericity (e.g., see Judd et al., 1995) for the same reason that sphericity is not a problem for factors with only 2 levels.

In complex ANOVA or MANOVA designs an approach involving a priori contrasts is also preferable for other reasons. Such designs leads to multiple tests of theoretically uninteresting hypotheses whereas selecting a small number of contrasts on the basis of prior theory is to be preferred on the basis of clarity, statistical robustness, control of Type I error and statistical power.


Mixed designs

Mixed designs (combining independent and repeated measures factors) muddy the waters somewhat. Mixed measures ANOVA requires that multisample sphericity holds. This more-or-less means that the covariance matrices should be similar between groups (i.e., across the levels of the independent measures factors). Provided group sizes are equal (or at least roughly equal) the Greenhouse-Geisser and Huynh-Feldt corrections perform well when multisample sphericity doesn't hold and can therefore still be used. If these corrections are inappropriate, or if group sizes are markedly unequal then more sophisticated methods are required (Keselman, Algina & Kowalchuk, 2001). A description of these methods is beyond the scope of this summary (possible solutions include multilevel methods found in SAS PROC MIXED, MlWin and HLM, though Keselman et al. also discuss a number of other options). For most psychological research best advice is, if at all possible, to keep group sizes in mixed ANOVA equal (or as close to equal as feasible).


Bibliography

Baguley, T. (2004). Understanding statistical power in the context of applied research. Applied Ergonomics, 35, 73-80.

Field, A. (1998). A bluffer's guide to ... sphericity. The British Psychological Society: Mathematical, Statistical & Computing Section Newsletter, 6, 13-22.

Howell, D. C. (2002). Statistical Methods for Psychology. (5th. ed.). Belmont, CA: Duxberry Press.

[David Howell has just released a sixth edition which looks excellent. At first glance the main changes are in layout and use of software examples. Once I'm sure of the the content I'll update this reference.]

Judd, C. M., McClelland, G. H., & Culhane, S. E. (1995). Data analysis: continuing issues in everday analysis of psychological data. Annual Review of Psychology, 46, 433-465.

Keselman, H. J., Algina, J., & Kowalchuk, R. K. (2001). The analysis of repeated measures designs: a review. British Journal of Mathematical and Statistical Psychology, 54, 1-20.

Kirk, R. E. (1995). Experimental Design: Procedures for the Behavioral Sciences. (3rd ed.). Pacific Grove: Brooks/Cole.

Miles, J. (2003). A framework for power analysis using a structural equation modelling procedure. BMC Medical Research Methodology, 3: 27. Published online 2003 December 11. doi: 10.1186/1471-2288-3-27.


Footnotes:

1 As long as the treatment only has the effect of adding or subtracting to the group means (and doesn't influence their variances) the homogeneity of variance assumption isn't a problem. This special case is known as unit-treatment-additivity. Unfortunately life isn't always that simple: there are good reasons why treatments might be expected to influence both means an variances. For this reason it is always sensible to check the group variances in independent measures designs.

2 As a rule of thumb the largest group variance should be no more than three or four times as large as the smallest group variance.

3 Covariance matrices also crop up in all sorts of other statistics, but we can forget about that for now.

4 The diagonals contain the variances because samples share all of their variance with themselves. I've used s rather than the Greek sigma symbol because it turns out better when browsers with different fonts are used. s is normally used for samples and sigma for populations, but I'm using s throughout for convenience (sorry!). You can generate Greek letters by using the "symbol" font in many word processors (e.g., 's' for sigma, m for mu and so forth).

5 The covariance between groups will rarely be exactly zero in the samples. However, provided people are randomly assigned to groups, and each person contributes only one data point then we can be pretty certain that the covariances in the populations being sampled are zero (and therefore the independence assumption is met). Even if random assignment to groups doesn't occur the independence assumption is often reasonable. Any time we know or believe that the measures will be correlated (e.g., in matched designs) a repeated measures analysis should be used. (I use repeated measures in a non-technical rather than just to mean a type of ANOVA or general linear model).

6 I probably should have mentioned this earlier, but covariances (like correlations) can be both negative and positive (unlike variances which are always positive). Positive covariances occur between samples when two samples are positively correlated. Negative covariances occur between samples when two samples are negatively correlated. The idea of a negative covariance is often tricky to grasp - but it just means that as one group tends to vary upwards in value the other tends to vary downwards. So when checking covariances to see if they are similar bear in mind the sign of the covariance as well as its magnitude (e.g., a covariance of -124.3 is very different from +124.3).

7 By now you can probably appreciate why many text books focus on compound symmetry and don't cover sphericity in detail.

8 One of the nice things about this topic is that the tests have nice, impressive-sounding names.

9 The Greek letter epsilon is usually used. Greenhouse-Geisser estimates of epsilon have a little hat on top (^). Huynh-Feldt estimates have a little squiggle on top (~). You can generate Greek letters by using the "symbol" font in many word processors (e.g., 'e' for epsilon).

10 Later on we discuss the special case of k = 2 and the analagous case of paired t tests. Feel free to jump ahead if you wish.

11 You won't find tables for fractional d.f., but exact p values can be calculated if d.f. are fractional (most good computer packages do this automatically these days).

12 Why don't I like them? Apart from all the above reasons, I don't like the idea of using a significance test to test the assumptions of a significance test. If that was a good idea, why don't we use significance tests to test the assumptions of Mauchly's sphericity test or Levine's test of homogeneity of variances? At some point you've got look at the data (using graphical methods, descriptive statistics and so forth) and make a considered judgement about what procedures to use. My view is that the earlier this is done in the process the better ...


No comments: