Sunday, December 13, 2009

A statistical puzzle about averages II



1. Who is correct?  Professor B is correct. If the average family has 1.8 children then the average child would be expected to have more than 0.8 siblings.

2. Why?  The average child is not from the average family. The concepts of average child and average family are different. For this reason there should no expectation that the average child should be from a family with an average number of children. Although there are restricted circumstances under which this can happen, they are sufficiently implausible to be discounted in any real world application (e.g., if all families have exactly the same number of children).


In one case the unit of analysis of the family and in the other it is the child. A concrete example may help. (I'll stick to defining average as arithmetic mean throughout, but the same logic extends to other averages such as the median - see the quotation from Kish below).


Assuming that the number of children varies between families (which it must do if the mean number of children per family is 1.8) then the average child will be from a family with a larger number of children than average. For example, imagine there are only four families:





Number of siblings per child
Family
Number of Children
1st child
2nd child
3rd child
4th child
A
1
0
-
-
-
B
1
0
-
-
-
C
2
1
1
-
-
D
4
3
3
3
3


There are (1+1+2+4) = 8 children in the four families, thus the mean number of children per family is: 8/4 = 2.
 
It follows that the mean number of siblings per child is therefore:


(0+0+1+1+3+3+3+3)/8 = 14/8 = 1.75
 
So although each family has only two children (on average) each child has 1.75 siblings (not 1 sibling).
 
Note is that there are N = 4 families and n = 8 children. So switching the unit of analysis changes the denominator. Also note that while the families can plausibly be considered independent of each other the children can't (all children in the same family have the same number of siblings in this example, and more generally the number of siblings will be correlated).
 
What about zeroes?  In the calculations above I excluded childless households as families. If you include only households without children as families the discrepancy would be larger.
 
Does it matter?  Yes. Much real world data is clustered in this way. It is important to realize that the average  is not going to be from the average (e.g., the average worker isn't from the average firm).
 
In practical terms this means that careful attention needs to be paid to sampling from families, workplaces, schools and so forth. A random sample of children will disproportionately sample children from large families. There are also social policy implications (e.g., if you are interested in reducing child poverty). Another example is that a random sample of schools will disproportionately sample small schools.
 
There are also other problems with analysis of clustered data. For this reason anyone working with clustered data needs to seriously consider using multilevel modeling or other methods that i)  take into the clustering and ii) allow hypotheses about different levels of the model (e.g., children and families) to be explored.


Postscript
 
This is quite an old puzzle. I first came across this puzzle in the article by Jenkins and Tuten (1992). They include formulae for deriving one average from the other and cite Huntington (1924) and other later observations of the same phenomenon. I made the connection to multilevel models a little later. For a good (if slightly out of date) introduction to multilevel models see Snijders and Bosker (1999). Recently I noticed that Kish (1965) discusses the same phenomenon in passing.
 
This quote from Kish sets out the problem quite clearly:
Although the mean number of adults per household is only 2.02, the mean number of household members is 2.24 for the average adult. The greater size ranges of large organizations produce more striking effects. In 1960, 50 million people lived in 130 U. S. cities that had 100,000 or more population; in this population, the average city size was 0.39 million, but the size of the city in which the average person lived was 2.0 millions. Using medians does not help: the median city size was 0.19 million, but the median person lived in a city of 0.62 million.
Kish (1965, p. 571).
 References


Jenkins, J. J., & Tuten, J. T. (1992). Why isn’t the average child from the average family? – and similar puzzles. American Journal of Psychology, 105, 517-526.


Kish, L. (1965). Sampling Organizations and Groups of Unequal Sizes, American Sociological Review, 20, 564-72.


Sniders, T. & Boskers, R. (1999). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. London: Sage.

Wednesday, December 09, 2009

A statistical puzzle about averages I

I wrote this a few years ago for a departmental newsletter. For some reason the second part (with the answer) never got published. I stumbled across it almost by accident the other day and thought I'd share it. I'll publish the canonical answer in due course.

Professor Quack knows (from the UK census) that the average family has an arithmetic mean of 1.8 children. He also knows that (due to a bizarre mix-up in enrolment) that his Psychology For Everyone class is attended by a random sample of 50 people from the UK population. As part of an in-class demonstration of sampling theory he records the number of siblings of each student and calculates the average (using the standard formula for the arithmetic mean). 

To his dismay he discovers that the students in his class have an arithmetic mean of 1.2 siblings. He later repeats the demonstration with two more random samples of the UK population (again with n = 50) and obtains values of 1.1 and 1.3 siblings.

Profesor Quack consults two of his colleagues: Professor A and Professor B.

Professor A replies thus:

“You are right to be dismayed. Bias has somehow entered either your sampling procedure or your calculation of the mean. The true mean number of siblings should be 1.8 - 1 = 0.8.”

Professor B interjects thus:

“Nonsense! All is as it should be. The expected number of siblings in a random sample is most certainly not 0.8. Rather, one would expect the average student to have more than 0.8 siblings, just as you have observed.”

1) Who is correct?  Professor A or Professor B?

2) Why?

Tuesday, November 17, 2009

R functions for Dienes (2008) Understanding Psychology as a Science

I recently wrote a review of Understanding psychology as a science: an introduction to scientific and statistical inference by Zoltan Dienes (2008). Dienes' book covers Neyman-Pearson null hypothesis significance testing, Bayesian inference and the likelihood method of inference (inspired by Fisher and associated with A. W. F. Edwards and more recently R. Royall).

One of the most useful features of the book is that Dienes provides Matlab code for examples of calculations in the book (e.g., for Bayes factors, likelihood intervals and so forth). This is not so useful for me because I don't use Matlab. Matlab licenses are also quite expensive and may not be possible for students to access it in many Psychology departments. For those without access to Matlab, Dienes also provides calculators for a number of functions on his own web page for the book. (The calculators are found by following the links to the appropriate chapter, so the Bayes factor calculator is found by following the Chapter 4 link).

Danny Kaye and I thought it would be useful to write R code to compliment the Matlab code for Dienes' functions as a 'bonus feature' for the review. As these functions and the notes for them take up quite a bit of space we decided to include only one, for a Bayes factor, in the review itself (with some notes on how to use it). Danny did most of the work writing functions, which are more-or-less direct translations of the original Matlab code (and have been checked against the web versions). The full set of functions is hosted on his web site along with the notes on how to use them. Also included are page references for the examples in the book.

Why did we write the R functions? First, they offer convenient access to the functions for teachers and students (because R is free and runs on Windows, Mac OS or Linux operating systems). Second, it reduces the burden on Dienes' web calculator (at a marginal decrease in ease of use). Third, R is open source so it is simple to see how the code works and to edit, extend or adapt it (though it is polite to acknowledge the authors of the original code). Fourth, we want to encourage more people to start using R!

As an example, I've already written some alternative functions for likelihood intervals (though as I happened I re-wrote these almost from scratch to get them to plot the likelihood function and interval and to take advantage of some built-in R functions). Those functions are intended for a the book I'm working on and so should appear in due course.

For those who are interested Danny and I are presently working on implementing Bayesian t tests in R (Bayes factors with objective priors) in a user-friendly way for researchers.


References:

Baguley, T., & Kaye, W.S. (in press, 2009). Review of Understanding psychology as a science: An introduction to scientific and statistical inference. British Journal of Mathematical & Statistical Psychology.