Finding the Right Data to Train Your Recommender System

There is a such thing as “too much data” when training an algorithm.

Screen+Shot+2020-11-06+at+1.25.32+PM (1).png

Developing a good recommender system means finding enough reliable data to train your algorithm and deliver precise recommendations to every type of user. But believe it or not, there can be such a thing as “too much data.” Or, at the very least, there is a risk you can run in including more data in your training set than what you’re 100% certain will contribute to more accurate recommendations.

To give you a sense of what we mean, we’ll go back to the basics of recommender systems. As machine learning engineer George Seif helpfully explains, recommender AIs can be broken down into two broad categories: collaborative filtering systems and content-based systems. Collaborative filtering systems recommend certain items to users based on their relationship with other items — e.g. a system that uses your purchase history to recommend other retails products. Content-based systems integrate more data, like demographic information, into their recommendations. For example, it might consider your purchase history, your age, gender, and location when deciding which products you’re most likely to be interested in.

You might be tempted to think that content-based systems would be superior here because they use more data — after all, the more information your system has available to it, the better decisions it makes, right? But while a statistically significant volume of data is obviously important, the truth is that you need to be very picky about the information you train your system on. Could information about what 18-34 year-old males tend to shop for help you recommend products to a 24 year-old male user? It’s certainly possible. But since age and gender aren’t directly relevant to retail the way that purchase history is, you should consider whether that data will actually improve your recommender’s precision or potentially harm it.


Screen+Shot+2020-12-02+at+4.39.50+PM.png

This is a question we had to face when we set out to develop Watchworthy. Ranker has a vast trove of voting data about TV, but not all of it was relevant to the purpose of training a recommendation algorithm. For example, we know the voting behavior of people who voted for The Unbreakable Kimmy Schmidt on our list of Best TV Theme Songs of All Time, but that doesn’t necessarily indicate that those people like the show — just that they like the theme song. When compiling the data set that would train Watchworthy’s algorithm, we were careful to only include votes from lists that measured the quality of the shows on them, like Funniest Shows, Best Horror TV Shows of 2018, or Shows with the Best Writing.

All this is to say that, whether you’re using your own first-party data to train an algorithm or purchasing it from someone else, the cardinal rule with your training set isn’t “the more the merrier!” It’s best to scrutinize your training set and cut out anything that you’re not certain will get you the results you’re looking for.


These stories are crafted using Ranker Insights, which takes over one billion votes cast on Ranker.com and converts them into actionable psychographics about pop culture fans across the world. To learn more about how our Ranker Insights can be customized to serve your business needs, visit insights.ranker.com, or email us at insights@ranker.com.


MORE INSIGHTS LIKE THIS:

Screen+Shot+2020-11-06+at+1.25.32+PM (2).png

What Is Watchworthy?


Screen+Shot+2020-11-06+at+1.33.01+PM (1).png

The Road from Pop Culture Lists to Watchworthy (White Paper)

Previous
Previous

Diving Into Animaniacs

Next
Next

Ranker Spotlight: Chris Hemsworth