Can a software program effectively replace human essay readers?
Answers to that question vary.
A competition sponsored by the William and Flora Hewlett Foundation recently put up $100,000 in three prizes to discover a program that performs as well as human scorers in evaluating written essays. The competition, which was posted on Kaggle, a global crowdsourcing and collaboration Website for predictive modeling experts, drew 258 players in 159 teams.
The three-person team of “SirGuessalot & PlanetThanet & Stefan” (aka Momchil Georgiev), Jason Tigge, and Sefan Henß) arrived at a system the judges found closest to human reader results. They won the foundation’s first-place award of $60,000. Second place winners will be awarded $30,000 and third place winners, $10,000.
Separately, data scientist Ben Hammer partnered with Mark Shermis, the dean of the University of Akron's College of Education, to research and write “Contrasting State-of-the-Art Automated Scoring of Essays: Analysis.” Their research, also funded by the Hewlett Foundation, leads them to conclude that “the automated essay scoring engines performed quite well.”
Both examples showcase the quest for so-called automated assessment systems to “grade” or judge written work. Some regard this effort as a sign of real progress. In the view of Steve Graham, a professor at Vanderbilt University, humans are not very good at objective assessment.
That is also what Leonard Mlodinow suggests in his book The Drunkard’s Walk: How Randomness Rules Our Lives (Pantheon Books, 2008). Mlodinow recounts his dismay at a 93, the score his son’s high school teacher put on the paper that he -- a published writer -- had rewritten. He attributes the missing points to teacher fallibility, contending that “a teacher’s assessment, like any measurement, is susceptible to random variance and error” (p. 126).
Other people are appalled at the prospect of a machine assessing human writing. Les Pearlman, the director of writing at MIT, falls into that camp. He finds the e-Rater automated scoring system from nonprofit group ETS seriously flawed because it “can be easily gamed, is vulnerable to test prep, sets a very limited and rigid standard for what good writing is, and will pressure teachers to dumb down writing instruction.” Pearlman shows that essays that include patently false statements can still earn perfect scores on e-Rater.
It is only fair to point out that it is possible to “game” a human monitor, as well. For instance, students have observed that the key to getting an A from a certain teacher is to include a PowerPoint presentation.
The problem of incorrect statements is not unique to e-Rater either. As a regular scorer for the SAT essay, I have long ago internalized that I am not supposed to hold statements like “Albert Einstein invented the lightbulb” against the student. The rationale behind that is that we are assessing the students’ ability to develop and support a point of view -- not how well they know the history, literature, or science they refer to. If that’s a flaw, it exists in tests scored by humans as well as by machines.
Imperfect though they may be, automated assessment systems are not only on the way, they are already here. As Hammer and Shermis’s analysis quoted above points out, automated systems currently take the place of a human as second reader “for high stakes assessment in several general tests (e.g., TOEFL, GMAT) and... for some licensing exams (e.g., AICPA).”
As a result of the Hewlett competition, it is possible that even more exams will be scored by automated assessment systems. Some may regard that as a blessing, but others as a curse. It certainly has kicked up quite a bit of debate. What do you think?
This reminds me that I have a very smart and tech-savvy friend who thinks having machines write successful pop tunes is just around the corner. He thinks the algorithms are readily discoverable. Does that sound plausible?
If so, I can believe that the machines could at least be successful in screening written texts for quality. Nicole is right, of course, that they are going to miss things which are deeply interesting, but not in a predictable or mainstream way. For better or worse, a Gertrude Stein would surely get an F.
I remember having to completely reverse my thesis in an essay for an AP English class in order to get a good grade. I didn't really agree with what I had written, but my teacher did, so I got the 'A' I wanted.
Some of my kids are now being assigned to log in to a particular website to do their writing, and it grades them as they write. My daughter was trying to get a better grade, and enlisted my help as an editor. The more correct her writing became, the lower the score got. We finally just gave up.
@Joe When I worked as writing center tutor in colleges, students would boast of how they identified what they considered key to getting a better grade. I recall one student showing off his story (this was a writing exercise as a fable with a moral based on Beat Not the Poor Desk) in which a lioness argues for the right to join the male lions on the hunt. He thought he was being very clever in appealing to the teacher's feminist streak. He had no clue, of course, that, in fact, the lionesses are the ones who hunt, and the male lions usually stay back. Now, if he was being graded based on knowledge, he would lose points on that.
Joe, I think we're in complete agreement. All programs reflect the views and biases of their builders (hello, Tron). Always have, always will. But the amazing benefits of letting students build their writing skill with automated assistance are to be highly encouraged (there are never enough hours in the day for even the most talented teachers to give their students all the attention each requires). I just hope the tools are used in the right mixture, as an aid rather than a crutch. Oh, and just so you understand: I do believe my "exhibits" were just part of my education. For every example like these, I can cite others, like the math TA who didn't laugh when I "re-discovered" Pascal's Triangle, or the Journalism prof who had me read my news copy from the college radio station in class so his students could see the difference between print and broadcast journalism. I'll try to do a better job in the future of making my sense of irony more obvious. And one final note: can't recall who had pointed out that students will learn to 'game' the device to boost their scores. What, that hasn't already been done with professors? Ariella's example of a prof giving a better grade because he supported the student's position is proof positive that this kind of 'gaming' has always been part of the process. The more things change....
Compelling point about human frailties and biases, Duke, but that would seem to support the opposite point if you think about it -- because the computers are only as constant and unbiased as the humans who build and program them.
Indeed, linguistic sentiment analysis, while a VERY promising field, still isn't quite there yet in terms of picking up subtleties of human language (esp. things like irony and sarcasm). For this reason, a creative, unique, poignant, extremely well-written, and otherwise spot-on student paper could be given a sub-par grade because it does not neatly fit the rubric that the machine is looking for.
Similarly, it seems to me that students, once they catch wise, would learn how to game the machines, writing merely what the machine is scanning for without actually putting much of substance or literary merit down on the paper.
The incidences of corruption you suffered academically are inexcusable, and I have my own tales I could tell, but I'm not sure our collective anecdotal evidence merits discarding the baby with the bathwater.
@mhhfive As an instructor, I've sometimes shared a student essay with the class, though due the defensive reaction one encounters, that is usually done for very good essays -- catching what they do right rather than pointing out what they do wrong. For the latter, you usually have to work with something not written by a student in the class because some people are so sensitive to criticism that they will not view it as constructive and may even feel publicly shamed. For a while it was popular to have students write in groups. Aside from the problems that Susan Cain points out with such set up in Quiet -- that groups stifle introverts altogether -- there could be the problem of the blind leading the blind because many may not be athe point when they can recognize what is correct or what makes a better constructed sentence or paragraph. Consequently, the teacher really has to direct things and point students in the right direction. That is difficult to do for each and every piece of writing in a class size that can easily exceed 20 students; in fact, it may be impossible given time constraints for the class. So I can see the appeal of letting a computer take over the individualized response for the writing stages.
@Nicole, But what of the 9 publishers who passed on Harry Potter? ertainly, to err is human, and that includes errors of assessment.That is not to say that a robo-reader would pick up on the fact that this story would spawn the biggest hit ever for children's books and films. In truth, these kinds of things are not altogether predictable because success does not depend on quality alone but on the convergence of favorable circumstances to bring the work to the public's attention at just the time when it has a taste for it.. It is possible that Harry Potter may have had only modest sales if it had come out 20 years earlier, it would not have been the mega-hit it was. Steve Heley's How I Became a Famous Novelist, which did not become that kind of hit, highlights the vagariesof the publishing industry and posits a writer gaming the system, as it were.
The ThinkerNet does not reflect the views of TechWeb. The ThinkerNet is an informal means of communication to members and visitors of the Internet Evolution site. Individual authors are chosen by Internet Evolution to blog. Neither Internet Evolution nor TechWeb assume responsibility for comments, claims, or opinions made by authors and ThinkerNet bloggers. They are no substitute for your own research and should not be relied upon for trading or any other purpose.
The day before Valentine’s Day, hundreds of thousands of people watched a video that featured the sentence “I love you” in 100 different languages. That video, widely shared on social networking sites, was made by Memrise, a learning site based in London. Languages are among the things you can learn through its memory techniques. And, unlike Rosetta Stone, the site is free.
“This is the largest classroom in the world, Professor -- television.” That’s what Charles Van Doren is told in the movie Quiz Show. And now, the potential for education assigned to television in the 1950s and described in that film is now found on the Internet.
While NFC's original goal was to enhance mobile commerce applications, it is finding its way into a number of other uses, which is creating both opportunity as well as challenges for IT departments.
New tools like laptops, tablets, smartphone, and wireless connectivity let us work from San Diego to Katmandu, and anywhere in between. But time management remains a problem.
Showing results is the best way to win over social business doubters, according to Mary Maida, Medtronic lead information solutions manager. Internet Evolution's Mitch Wagner interviewed Maida at the E2 Innovate conference.
Companies need to take advantage of new technologies to simplify interfaces, improve capabilities, and enhance back-office processes. But they can't upgrade their Websites too often.
A recent survey by Endace found that 23% of companies experience some type of network problem daily and another 25% have a serious problem each month. Enterprise networks are still very unreliable and probably will continue to be in the near term.
Wells Fargo uses social software to replace email chains and help its sales team collaborate more effectively to land deals, according to Kelli Carlson-Jagersma, VP Collaboration Strategy for Wells Fargo. Mitch Wagner spoke with Carlson-Jagersma at the E2Innovate conference
The medical instruments manufacturer looks to metrics to quantify its social business engagement, according to Mary Maida, Medtronic lead information solutions manager. Internet Evolution editor in chief Mitch Wagner interviewed Maida at the E2 Innovate conference.
New York's Metropolitan Transit Authority is conducting a pilot test of digital kiosks to guide subway users to where they want to go more efficiently and at lower cost.
The whole Amazon.reader debate is a double-stupid. It's stupid to think that there's any e-book buyer who doesn't know Amazon's URL, and it was stupider to let ICANN launch the whole free-form TLD initiative to start with.
While NFC's original goal was to enhance mobile commerce applications, it is finding its way into a number of other uses, which is creating both opportunity as well as challenges for IT departments.
Enterprises would like to move to cloud computing but are hesitant because they are concerned about providers’ ability to secure company data. Here are some tips that help to ensure that if breaches occur, the business is not left holding the bag.
Edmunds separates customers into segments based on the info it collects on its site and from partners, and uses that to push out custom content, said Brian Baron, director of business analytics for Edmunds.com, at Predictive Analytics Innovation Summit.
The automotive website uses propensity modeling to target ads and customer registration forms, said Brian Baron, director of business analytics for Edmunds.com, at Predictive Analytics Innovation Summit.
Expert Integrated Systems: Changing the Experience & Economics of IT In this e-book, we take an in-depth look at these expert integrated systems -- what they are, how they work, and how they have the potential to help CIOs achieve dramatic savings while restoring IT's role as business innovator. READ THIS eBOOK
your weekly update of news, analysis, and
opinion from Internet Evolution - FREE! REGISTER HERE
Wanted! Site Moderators Internet Evolution is looking for a handful of readers to help moderate the message boards on our site as well as engaging in high-IQ conversation with the industry mavens on our thinkerNet blogosphere. The job comes with various perks, bags of kudos, and GIANT bragging rights. Interested?
To save this item to your list of favorite Internet Evolution content so you can find it later in your Profile page, click the "Save It" button next to the item.
M2M: Rise of the Machines? Not Yet David Weldon In the 1970 science fiction thriller Colossus: The Forbin Project, two giant supercomputers from the United States and Soviet Union secretly join forces to take control of the collective nuclear might of the two countries. In the film, the two machines discover each other's existence, communicate back-and-forth, share their collective data, and cut their human creators out of the process. It is the ultimate example of machine-to-machine communications, or M2M. CLICK FOR MORE
M2M: Rise of the Machines? Not Yet David Weldon In the 1970 science fiction thriller Colossus: The Forbin Project, two giant supercomputers from the United States and Soviet Union secretly join forces to take control of the collective nuclear might of the two countries. In the film, the two machines discover each other's existence, communicate back-and-forth, share their collective data, and cut their human creators out of the process. It is the ultimate example of machine-to-machine communications, or M2M. CLICK FOR MORE
M2M: Rise of the Machines? Not Yet David Weldon In the 1970 science fiction thriller Colossus: The Forbin Project, two giant supercomputers from the United States and Soviet Union secretly join forces to take control of the collective nuclear might of the two countries. In the film, the two machines discover each other's existence, communicate back-and-forth, share their collective data, and cut their human creators out of the process. It is the ultimate example of machine-to-machine communications, or M2M. CLICK FOR MORE
M2M: Rise of the Machines? Not Yet David Weldon In the 1970 science fiction thriller Colossus: The Forbin Project, two giant supercomputers from the United States and Soviet Union secretly join forces to take control of the collective nuclear might of the two countries. In the film, the two machines discover each other's existence, communicate back-and-forth, share their collective data, and cut their human creators out of the process. It is the ultimate example of machine-to-machine communications, or M2M. CLICK FOR MORE