Friday, January 09, 2009

Simulating data for inquiry based learning

I've just presented a talk on at the Statistics for Psychology Students workshop for the HEA Psychology Network in York. Richard Rowe (Psychology, Sheffield) gave an interesting talk on teaching statistics via inquiry based learning. Part of the work involved using getting students to generate their own research questions in a tutorial and then analyzing data addressing these questions in a follow-up tutorial. A rather clever idea he reported was to generate suitable data via simulation in STATA. The students were first or second year undergraduates and so the experimental designs they came up with were constrained to simple cases (e.g., t tests, correlations and chi-square for the first years; one-way ANOVA with 3 levels for the second years). This meant that whoever led the tutorial (e.g., a Ph.D. student) could generate tailored data to match the students' research designs.

The novel and clever aspect of this isn't simulating real data sets or getting students to come up with their own research ideas, but combining the two. This means that you can run these sessions without getting students to collect real data. (There is a place for that too, but collecting real data has massive overheads in terms of student time, staff time, ethics approval and so forth). Furthermore, students can legitimately come up with ideas for studies that can't be run by undergraduates for ethical, resource or other reasons - for example using clinical populations). I can also see other uses for this. For example, a supervisor could simulate data for a final year project and a student could use the simulated data as a dry run for the real analysis. (The real data will usually be much messier, but I think that getting familiar with the statistical software and analyses to be used could be useful for some students).

A few members of the audience asked questions about STATA (which I'm not familar with) and I pointed out that you could do the same things in R for free. As I understand it STATA allows you to simulate data with a specified covariance matrix fairly easily. I'm sure this can be done in R too, but I'm still learning how to use R and tracking down the right package and commands would have taken a bit of time. In any case, for this to be useful it needs to be very simple and easy to run by people with relatively basic statistical computing skills (and who have never used R before). So I set myself a challenge of writing code that should run with functions from the R base package, be trivial to edit and generate usable data for two common analyses: the independent t test and a bivariate correlation.

It took me about 2 or 3 minutes to write the independent t test code. (This may sound impressive but a competent R programmer could probably do it in under a minute). It took a little longer to comment it and work out how to write it to a tab delimited text file. It took me about 5 or 10 minutes to do the bivariate correlation and a little longer to fix a rather stupid error I'd made. All the code is rather clumsily written (and in some cases I've deliberately separated out steps to make it easier for someone unfamiliar with R to follow). The correlation solution (in particular) isn't very good and specifying a covariance matrix to constrain a simulation would be much more satisfactory. (I should also note that it uses a trick I picked up some time ago from Paul Barrett's web site in his article on Correlation attenuation due to Likert categorization.) The key point is that it took longer to describe here than it took to write. I'm sure better solutions exist in R, but these ones work and could easily be extended to paired t tests or one-way ANOVA. (I'm not sure about chi-square. I could write something, but it might be a lot easier just to specify a 2x2 or 3x2 contingency table by hand).

Running the R code is also easy. R runs on PC, Linux and Mac OS X, but if you don't fancy installing it (for some bizarre reason) it can also be run from a web server (though it won't be able to write to a file). To run the code you just paste it into the R console and hit return. To tailor the code just read the comments and tweak the parameter values before pasting.

If you want a copy to play with just email me.

For anyone who wants to do some simulation in R there are lots of resources around (try googling the obvious keywords), but Andrew Gelman and Jennifer Hill's book Data Analysis Using Regression and Multilevel/Hierarchical Models (Analytical Methods for Social Research) is an impressive book for a serious take on using simulation (although the stance is Bayesian this doesn't get in the way - at least for the sections I've read; I'm about 40% through the book). 

Postscript. All the talks were video recorded and Anne Trapp threatened to put video podcasts - vodcasts? - on the Psychology Network site at some point in the future - including my talk (Effect size: why what we teach psychology students is wrong). These will be worth looking out for just to see Andy Field's talk. (Andy claims he looks particularly ridiculous when video recorded. My advice: just don't ever look at the recording. Ever.) I'll write something about effect size in the near future - though much of the talk was based on my forthcoming BJP paper on effect size.

No comments:

Post a Comment