Saturday, November 09, 2013

Multicollinearity and collinearity (in multiple regression) - a tutorial

This blog post was written for undergraduate research methods teaching. I have therefore tried to keep everything relatively simple and equation-free. The content is based loosely on more detailed material in my book Serious stats.


What are collinearity and multicollinearity?

Collinearity occurs when two predictor variables (e.g., x1 and x2) in a multiple regression have a non-zero correlation. Multicollinearity occurs when more than two predictor variables (e.g., x1, x2 and x3) are inter-correlated.

How common is collinearity or multicollinearity?

If you collect observational data or data from a non-experimental or quasi-experimental study collinearity or multicollinearity will nearly always be present. The only studies where it won’t tend to occur (unless you are very, very lucky) is in certain designed experiments – notably fully balanced designs such as a factorial ANOVA with equal n per cell. Thus the most important issue is not whether multicollinearity or collinearity is present but what impact it has on your analysis.

Do collinearity or multicollinearity matter?

I find it helpful to break down collinearity and multicollinearity into three situations, of which only the third is common.

[Note: From here on I’ll just use the terms collinearity and multicollinearity more or less interchangeably for convenience. This is also common practice in the literature]

1. Perfect collinearity. If two or more predictors are perfectly collinear (you can perfectly predict one from some combination of the others) then your multiple regression software will either not run (e.g., return an error) or it will drop one or more predictors (and possibly also return an error). Perfect collinearity happens a lot by accident (e.g., if you enter two versions of an identical variable such as the mean score and total score of a scale, or dummy codes for every category of a categorical predictor).

2. Almost perfect collinearity. If the correlation between predictors isn’t quite perfect (but is very close to r = 1) then this can sometimes cause “estimation problems” (meaning that the software you are using might be able to run the analysis or might generate incomplete output). Most modern software can cope with this situation just fine – but even then estimates will be hard to interpret (e.g., be implausibly high or low and have very large standard errors). If your software can cope with this situation then technically the estimates will be correct and you just have an extreme form of situation 3 below.

3. Multicollinearity. As already noted, most situations in which you would use regression (apart from certain designed experiments) involve a degree of multicollinearity. So if a method section ever claims that “multicollinearity is not present”, generally this will be untrue. A better statement to make is something along the lines of “there were no problems with multicollinearity”. However, generally this will also be untrue. For this to be true, the degree of multicollinearity needs to be very small or the sample size very large (or both). Neither is common in psychological research.

To understand why it is necessary to consider what impact multicollinearity has on:

i) the overall regression model,
ii) estimates of the effects of individual predictors.

The good news

Multicollinearity has no impact on the overall regression model and associated statistics such as R2, F ratios and p values. It also should not generally have an impact on predictions made using the overall model. (The latter might not be true if the predictor correlations in the sample don’t reflect the correlations in the situation you are making predictions for – but that isn’t really a multicollinearity issue, but a consequence of having an unrepresentative sample).

The bad news

Multicollinearity is a problem if you are interested in the effects of individual predictors. This turns out to be a major issue in psychology because this is probably the main reason that psychologists use multiple regression: to tease apart the effects of different predictors. There are two main (albeit related) issues here: the first is a philosophical problem and the second is a statistical one.

The philosophical issue. If two or more predictors are correlated then it is inherently difficult to tease apart their effects. For instance, imagine a study that looks at the effect of happiness and depression on alcohol consumption. If happiness is highly correlated with depression (e.g., r = -.90) then regression commands in packages such as SPSS or R will come up with estimates of the unique effect of happiness on alcohol consumption (by holding depression constant). This estimate is an estimate of the effect of happiness that ignores their shared variance. However, depression isn’t generally constant if happiness varies; they tend to vary together.

The philosophical issue is this: is it meaningful to interpret the unique impact of happiness if happiness and depression are intimately related. Although this philosophical issue is potentially important, researchers often tend to ignore it. The main advice here is to think carefully before trying to interpret individual effects if there is a high level of multicollinearity in your model.

The statistical issue. The underlying statistical issue with multicollinearity is fairly simple. The unique effects of individual predictors are estimated by holding all other predictors constant and thus ignoring any shared variance between predictors. A regression model uses information about the variation between predictors and the associated variation in the outcome (y variable) to calculate estimates. As n (the number of participants or cases) or sample size) increases, the more information you have and the greater the statistical power of the analysis. You also get more information from cases or participants that are more variable relative to each other. So a participant who is more extreme on a predictor has a bigger impact on the analysis than one that is less extreme. If multicollinearity is present then each data point tends to contribute less information to the estimate of individual effects than it does to the overall analysis. (Holding the effects of other predictors constant effectively reduces the variability of a predictor and thus reduces its influence).

Multicollinearity therefore reduces the effective amount of information available to assess the unique effects of a predictor. You can also think of it reducing the effect sample size of the analysis. For instance, in the happiness and depression example happiness and depression (where r = -.90) share (-.91)2 = .81 or 81% of their variance. Thus the tests of their unique use only (100 – 81) = 19% of the information (about a fifth) in the overall model and thus the effective sample size is over 5 times smaller.

Thus the fundamental statistical impact of multicollinearity is to reduce effective sample size and thus statistical power for estimates of individual predictors. It is worth looking at each of the main statistics in turn:
b (the unstandardized slope) – this parameter estimate remains unbiased, but is estimated less accurately when multicollinearity is present (i.e., its standard error is larger)
β (the standardized slope) – this parameter estimate remains unbiased, but is estimated less accurately when multicollinearity is present (i.e., its standard error is larger)
t (the t test statistic) – this is the ratio of the estimate to its standard error and thus will be smaller (and further from statistical significance)
95% CI (the 95% confidence interval) – this is the estimate plus or minus approximately two standard errors, thus the CI will be wider (reflecting greater uncertainty in the estimate)
Stability of estimates. Many textbooks refer to problems with the stability of estimates when multicollinearity is present. What this means is that estimates will jump around a lot if you add or drop predictors or between the same model in different data sets. This isn’t really a separate issue – just a logical consequence of having a smaller effective sample size. Any estimate based on a small effective sample size will be unstable in this sense. Statistics from small (effective) samples tend to be less similar to the population than large samples.

Detecting problems with multicollinearity

Two predictors. A natural starting point is to look at the simple correlations between predictors. If you have only two predictors this is sufficient to detect any problems with collinearity: if the simple correlation between the two predictors is zero then there is no problem. If the correlation is low then collinearity is probably just a minor nuisance – but will still reduce statistical power (meaning that you are less likely to detect an effect and the effect will be measured less accurately). A larger correlation indicates a more serious problem. Working out how severe the problem is not that easy and it is generally a good idea to use a collinearity diagnostic such as tolerance or VIF for this purpose.

More than two predictors. With more than two predictors the simple correlations between predictors can be misleading. Even if they are all very low (and unless they are exactly zero) they could conceal important multicollinearity problems. This will happen if the predictor’s correlations don’t overlap – and thus they have a cumulative effect (e.g., if the correlation between x1 and x2 explains a different bit of the variance in the outcome y than the correlation between x1 and x3).

Fortunately there are a number of multicollinearity diagnostics that can help detect problems. I will focus on perhaps the simplest of these: tolerance and VIF.

Tolerance. One way to think of tolerance is that it is the proportion of unique information that a predictor provides in the regression analysis. To calculate the tolerance you first obtain the proportion of predictor variance that overlaps with the other predictors. You then subtract this number from 1. For example, if the other predictors explain 60% of the variance in x1 then the tolerance of x1 (in a model with those predictors) is 1 – .6 = .4. Tolerance of 1 indicates no multicollinearity (for that predictor) and tolerance values approaching 0 indicate a severe multicollinearity problem.

Tolerance indicates how much information multicollinearity has cost the analysis. Thus tolerance of .4 indicates that parameter estimates, confidence intervals and significance tests for a predictor are only using 40% of the available information.

VIF. The VIF statistic of a predictor in a model is merely the reciprocal of its tolerance (i.e., VIF = 1/tolerance). So if tolerance is .4 then the VIF is 1/.4 = 2.5. VIF stands for variance inflation factor. This number indicates how much larger the error variance for the unique effect of a predictor (relative to a situation where there is no multicollinearity). The VIF can also be thought of the factor by which your sample size needs to be increased to match the efficiency of an analysis with no multicollinearity. So a VIF of 2.5 implies that you’d need a sample size 2.5 times larger than the one you actually have to overcome the degree of multicollinearity in your analysis.

Remedies

The best remedy for multicollinearity is either: i) to design a study to avoid it (e.g., using an appropriate experimental design), or ii) increase your sample size to make your estimates sufficiently accurate. If these are not feasible there are other options that may be helpful (but which can also be harmful).

Dropping a predictor. Generally this is a bad option (though many text books recommend it). The reason it is usually a bad idea is that hides the problem rather than solving it. For instance, if x1 and x2 are moderately correlated it is quite possible that each of them significantly predicts y on its own but neither unique effect is statistically significant when both are in the model. Dropping x1 will thus make it look as though x2 is predicting y on its own (an vice versa). The true state of affairs is that they are jointly predicting y and that their precise individual contribution to this joint prediction is unknown.

Worse still, dropping a predictor can be actively misleading (e.g., if you select the predictor you drop so that the final model supports your favoured hypothesis or theory). Sometimes dropping a predictor is a somewhat reasonable thing to do. One situation is when two variables are measuring more or less the same thing. For instance, if you have two measures of trait anxiety and they are highly correlated, it may well be reasonable to drop one of them (though in this case there are still better other options. Another situation is when you believe one variable is just a proxy for the other. For instance, both age and school year are highly correlated and both are predictors of arithmetic ability. In this case age may just be a proxy for school year (on the assumption that arithmetic is taught rather than acquired spontaneously as you age).

Combining or transforming predictors. If you have highly correlated predictors it is usually better to combine them in some way rather than drop them from the analysis. There are many ways that predictors could be combined (and statistical procedures such as factor analysis exist that are designed to do exactly this). However, even crude methods such as adding predictors together (or averaging them) can be surprisingly effective (though it may be necessary to rescale them if they are not on the same scale). Other options may also suggest themselves (e.g., using the difference between predictors or some weighted combination) depending on the theory motivating your model.

Do nothing. Sometimes the best thing to do is nothing. You may just wish to honestly report that a set of predictors jointly predicts some outcome and that more data are required to tease their individual effects apart. Alternatively, you may not care that some of your predictors are highly correlated. For instance if you have some predictors of theoretical interest and some that are not (e.g., because they are potential confounding variables), as long as the predictors you are interested in have high tolerance it won’t matter if the other predictors have low tolerance. Such predictors are sometimes called nuisance variables – and what matters is that you have dealt with them in some way (not whether you have estimated them accurately).

Conclusions

There are four main conclusions to take from this tutorial:

1) Multicollinearity is nearly always a problem in multiple regression models
2) Even small degrees of multicollinearity can cause serious problems for an analysis if you are interested in the effects of individual predictors
3) Small samples are particularly vulnerable to multicollinearity problems because multicollinearity reduces your effective sample size for the effects of individual predictors
4) There are no ‘easy’ solutions (e.g., dropping predictors is generally a bad idea)

Further reading

More detail on multicollinearity can be found in: Baguley, T. (2012). Serious stats: A guide to advanced statistics for the behavioral sciences. Basingstoke: Palgrave.

Update

I have a short note on my book blog about getting multicollinearity diagnostics in R.



No comments:

Post a Comment