October 25, 2006 -- If you liked “Inside Man,” “Walk the Line” and “Crash,” but didn’t like “The da Vinci Code” or “Big Fish,” how will you feel about “King Kong”?
Coming up with a new ways to predict what movies people will enjoy based on past experiences could win you the Netflix Prize -- and leave you one million dollars richer. To help set up this competition, Netflix called on Charles Elkan, a UC San Diego computer science and engineering professor who runs data mining competitions for students.
To win the Netflix Prize, a contestant must improve the video rental company’s existing in-house movie recommendation system by more than 10 percent. The system would use existing information on what movies customers have liked and disliked in the past to make movie recommendations.
Building such a system requires a data mining approach, and Elkan has clear ideas on how data mining competitions ought to be run.
As a paid consultant, Elkan helped to shape the Netflix Prize competition in three main ways.
First, to make it more fun, Elkan suggested an online leaderboard to allow both spectators and contestants to monitor the competition from the web.
Second, Elkan pushed to have the competition switch into a final month-long phase as soon as one contestant reaches the 10 percent improvement mark. “If you want a competition to promote the growth of knowledge, you let people know it’s feasible and see if anyone else can reach the goal,” explained Elkan. The final 30 day period also helps to guard against something similar to “Ebay sniping” -- the online auction phenomenon in which people wait until just before the end of the auction (or in this case, competition) to submit their final bid (movie recommendation system) in an attempt to buy the item (win the prize) at the lowest possible price (lowest level of improvement above the 10 percent minimum) by catching the other bidders (contestants) off guard.
Finally, Elkan called for data segregation. The data sets that contestants use to test their movie preference prediction systems will not be used to evaluate system performance at the final judging. Keeping familiar data sets out of the judging phase should help ensure that the best system wins by penalizing algorithms geared specifically to the training data or leaderboard data, a phenomenon known as overfitting the data.
These suggestions come from Elkan’s experiences with UCSD data mining competitions, events that promotes teamwork, design, and project cooperation with industry – all important trends in computer science and engineering education.
Fraud detection, business-to-consumer relationship management and manufacturing optimization all involve data mining, explained Elkan, who noted that the UCSD competition is sponsored by Fair Isaac.
“All the big car companies now use data mining on warranty claim data to identify, as soon as possible, if some part is likely to fail,” said Elkan, who is looking forward to both the 2007 UCSD data mining competition and further developments in the Netflix Prize competition.