Sunday, December 13, 2009

A statistical puzzle about averages II



1. Who is correct?  Professor B is correct. If the average family has 1.8 children then the average child would be expected to have more than 0.8 siblings.

2. Why?  The average child is not from the average family. The concepts of average child and average family are different. For this reason there should no expectation that the average child should be from a family with an average number of children. Although there are restricted circumstances under which this can happen, they are sufficiently implausible to be discounted in any real world application (e.g., if all families have exactly the same number of children).


In one case the unit of analysis of the family and in the other it is the child. A concrete example may help. (I'll stick to defining average as arithmetic mean throughout, but the same logic extends to other averages such as the median - see the quotation from Kish below).


Assuming that the number of children varies between families (which it must do if the mean number of children per family is 1.8) then the average child will be from a family with a larger number of children than average. For example, imagine there are only four families:





Number of siblings per child
Family
Number of Children
1st child
2nd child
3rd child
4th child
A
1
0
-
-
-
B
1
0
-
-
-
C
2
1
1
-
-
D
4
3
3
3
3


There are (1+1+2+4) = 8 children in the four families, thus the mean number of children per family is: 8/4 = 2.
 
It follows that the mean number of siblings per child is therefore:


(0+0+1+1+3+3+3+3)/8 = 14/8 = 1.75
 
So although each family has only two children (on average) each child has 1.75 siblings (not 1 sibling).
 
Note is that there are N = 4 families and n = 8 children. So switching the unit of analysis changes the denominator. Also note that while the families can plausibly be considered independent of each other the children can't (all children in the same family have the same number of siblings in this example, and more generally the number of siblings will be correlated).
 
What about zeroes?  In the calculations above I excluded childless households as families. If you include only households without children as families the discrepancy would be larger.
 
Does it matter?  Yes. Much real world data is clustered in this way. It is important to realize that the average  is not going to be from the average (e.g., the average worker isn't from the average firm).
 
In practical terms this means that careful attention needs to be paid to sampling from families, workplaces, schools and so forth. A random sample of children will disproportionately sample children from large families. There are also social policy implications (e.g., if you are interested in reducing child poverty). Another example is that a random sample of schools will disproportionately sample small schools.
 
There are also other problems with analysis of clustered data. For this reason anyone working with clustered data needs to seriously consider using multilevel modeling or other methods that i)  take into the clustering and ii) allow hypotheses about different levels of the model (e.g., children and families) to be explored.


Postscript
 
This is quite an old puzzle. I first came across this puzzle in the article by Jenkins and Tuten (1992). They include formulae for deriving one average from the other and cite Huntington (1924) and other later observations of the same phenomenon. I made the connection to multilevel models a little later. For a good (if slightly out of date) introduction to multilevel models see Snijders and Bosker (1999). Recently I noticed that Kish (1965) discusses the same phenomenon in passing.
 
This quote from Kish sets out the problem quite clearly:
Although the mean number of adults per household is only 2.02, the mean number of household members is 2.24 for the average adult. The greater size ranges of large organizations produce more striking effects. In 1960, 50 million people lived in 130 U. S. cities that had 100,000 or more population; in this population, the average city size was 0.39 million, but the size of the city in which the average person lived was 2.0 millions. Using medians does not help: the median city size was 0.19 million, but the median person lived in a city of 0.62 million.
Kish (1965, p. 571).
 References


Jenkins, J. J., & Tuten, J. T. (1992). Why isn’t the average child from the average family? – and similar puzzles. American Journal of Psychology, 105, 517-526.


Kish, L. (1965). Sampling Organizations and Groups of Unequal Sizes, American Sociological Review, 20, 564-72.


Sniders, T. & Boskers, R. (1999). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. London: Sage.

No comments:

Post a Comment