Cheap Data
A repository for information generated by the new education technologies stands as a model for the field
It’s called datashop, and it is probably the world’s largest repository of data generated by intelligent tutoring systems. Based at the Pittsburgh Science of Learning Center, which is funded by the National Science Foundation, DataShop boasted, as of this past February, some 192,000 hours of student data from studies around the world. (Student actions are recorded roughly every 20 seconds, and the data in DataShop are longitudinal, spanning semester or yearlong courses.) At the most detailed level, that data reflected 71 million individual “transactions,” or computer keystrokes, made by students. No names are linked to any of the data, and researchers who hold accounts can decide whether to make their data accessible to others or keep it private.
Yet DataShop’s most impressive accomplishment may be that researchers who use it are compelled to put their data into a standardized format—which means that, should they choose to share that information, other investigators will be able to use it for comparisons, meta-studies or even secondary analysis. DataShop thus makes it possible for researchers to test their hypotheses without the bother and expense of ever entering a classroom.
This does not mean that the proprietors of DataShop (or anyone else in the field) have yet cornered the market on how to actually make use of all that information. In fact, in 2010 DataShop hosted the annual competition of an organization called SIGKDD (the Association for Computing Machinery’s Special Interest Group on Knowledge Discovery and Data Mining), challenging teams from around the world to analyze a year’s worth of data from 9,000-plus algebra students to predict the students’ future academic performance. The sample was so large —9 gigabytes—that only 130 of the 600 teams entering the contest successfully submitted a possible solution. (Although the competition has since ended, its website is still open and people are still using it.)
“The issue is how to create algorithms that can turn millions of data points into information that educators can use,” says John Stamper, a Carne-gie Mellon faculty member who is DataShop’s Technical Director.
At a presentation by Stamper at the first of two conferences on educational data-mining convened by TC President Susan Fuhrman in Washington, D.C., researchers were uniformly enthusiastic about being able to access data from other studies. Still, they saw plenty of room for improvement.
“One problem with secondary data analysis is that it’s secondary,” said Kurt Van Lehn, Professor of Computer Science and Engineering at Arizona State University. “The people doing the secondary analysis weren’t there. So if DataShop could be a video recording shop, too, that would be great. Otherwise you risk getting theories that are not grounded in the reality of what was going on when the data were recorded.”
Published Wednesday, May. 2, 2012