How much should I trust the leaderboard?

Uncategorized No comments

The leaderboard on the Fragile Families Challenge submission site is often the first thing participants focus on. It is therefore important to understand!

Why do we like the leaderboard?

The leaderboard:

shows rankings in real-time, motivating better submissions
demonstrates that models that predict well in the training data do not necessarily perform well in an out-of-sample test
makes the Challenge more fun!

Understanding the data split

However, the leaderboard is only a small portion of the overall data. In fact, the observations (rows) in the data are split into:

4/8 training data
1/8 leaderboard data
3/8 test data

As discussed in our blog post on evaluating submissions, final evaluation will be done on a separate set of held-out test data – the 3/8 portion referenced above. This means all awards (including the progress prizes) will be conducted on the test data, not the leaderboard. Likewise, our follow-up interviews will focus on the test set observations that were not used for training. Separation between the leaderboard and test sets is important; the leaderboard set isn’t truly held out since everyone receives repeated feedback from this set throughout the challenge!

Implications for strategy

What does this mean for your ideal strategy? How can you best make use of the leaderboard?

The leaderboard gives an instant snapshot of your out-of-sample performance. This can be useful in evaluating your model, much as splitting your own training set can be helpful.
However, over-fitting to the leaderboard will only hurt your score in the final test set evaluation
Leaderboard scores are noisy measures of generalization error because they are based on a small sample. So, even as a measure of generalization error, the leaderboard should be interpreted cautiously!

In summary, we expect some models to perform better in the final evaluation than the leaderboard suggests, due to random noise. Likewise, some models will look good on the leaderboard but perform poorly in the final evaluation because they got lucky in the leaderboard. Some submissions may even under-perform in the final evaluation because they made too many modeling adjustments to fit closely to idiosyncrasies of the leaderboard!

Your final evaluation will not be based on the leaderboard, so you are best advised to use it cautiously as one (noisy) bit of information about your generalization error.

Our Blog

Ian Lundberg

May 5, 2017