Can a software program effectively replace human essay readers?
Answers to that question vary.
A competition sponsored by the William and Flora Hewlett Foundation recently put up $100,000 in three prizes to discover a program that performs as well as human scorers in evaluating written essays. The competition, which was posted on Kaggle, a global crowdsourcing and collaboration Website for predictive modeling experts, drew 258 players in 159 teams.
The three-person team of “SirGuessalot & PlanetThanet & Stefan” (aka Momchil Georgiev), Jason Tigge, and Sefan Henß) arrived at a system the judges found closest to human reader results. They won the foundation’s first-place award of $60,000. Second place winners will be awarded $30,000 and third place winners, $10,000.
Separately, data scientist Ben Hammer partnered with Mark Shermis, the dean of the University of Akron's College of Education, to research and write “Contrasting State-of-the-Art Automated Scoring of Essays: Analysis.” Their research, also funded by the Hewlett Foundation, leads them to conclude that “the automated essay scoring engines performed quite well.”
Both examples showcase the quest for so-called automated assessment systems to “grade” or judge written work. Some regard this effort as a sign of real progress. In the view of Steve Graham, a professor at Vanderbilt University, humans are not very good at objective assessment.
That is also what Leonard Mlodinow suggests in his book The Drunkard’s Walk: How Randomness Rules Our Lives (Pantheon Books, 2008). Mlodinow recounts his dismay at a 93, the score his son’s high school teacher put on the paper that he -- a published writer -- had rewritten. He attributes the missing points to teacher fallibility, contending that “a teacher’s assessment, like any measurement, is susceptible to random variance and error” (p. 126).
Other people are appalled at the prospect of a machine assessing human writing. Les Pearlman, the director of writing at MIT, falls into that camp. He finds the e-Rater automated scoring system from nonprofit group ETS seriously flawed because it “can be easily gamed, is vulnerable to test prep, sets a very limited and rigid standard for what good writing is, and will pressure teachers to dumb down writing instruction.” Pearlman shows that essays that include patently false statements can still earn perfect scores on e-Rater.
It is only fair to point out that it is possible to “game” a human monitor, as well. For instance, students have observed that the key to getting an A from a certain teacher is to include a PowerPoint presentation.
The problem of incorrect statements is not unique to e-Rater either. As a regular scorer for the SAT essay, I have long ago internalized that I am not supposed to hold statements like “Albert Einstein invented the lightbulb” against the student. The rationale behind that is that we are assessing the students’ ability to develop and support a point of view -- not how well they know the history, literature, or science they refer to. If that’s a flaw, it exists in tests scored by humans as well as by machines.
Imperfect though they may be, automated assessment systems are not only on the way, they are already here. As Hammer and Shermis’s analysis quoted above points out, automated systems currently take the place of a human as second reader “for high stakes assessment in several general tests (e.g., TOEFL, GMAT) and... for some licensing exams (e.g., AICPA).”
As a result of the Hewlett competition, it is possible that even more exams will be scored by automated assessment systems. Some may regard that as a blessing, but others as a curse. It certainly has kicked up quite a bit of debate. What do you think?
— Ariella Brown is a freelance writer, editor, and social media consultant.