highly biased training/test split? (Chapter 2 - page 51)

Open aaarbk opened this issue 5 years ago • 1 comments

On page 53, the author mentions

[...] you can try to use the most stable features to build a unique identifier.

and then proceeds to build an id based on the latitude and longitude. However, several instances in the dataset have the same latitude and longitude and hence the same identifier, and therefore the same hash. Maybe I'm missing something, but doesn't this introduce a very strong algorithmic bias (if that's the right term) in the training set selection, in that instances with the same (latitude, longitude) will either always get placed in the same set (whether training or test)? Shouldn't we be using more features to compute a unique identifier?

Dec 26 '20 10:12 aaarbk

Hi @aaarbk, Thanks for your feedback. You're absolutely right, I assumed that every district had a unique position, but in practice there are actually duplicates. It's unfortunately too late to fix the book, but I'll add a note in the notebook to make this clear. 👍

Mar 03 '21 21:03 ageron