[QUESTION]

Open KonFey opened this issue 3 years ago • 1 comments

In chapter 2 on page 54, you define a test_set_change using a hash. If I understand the code correctly, you include a data point depending on whether it's hash value is below or above the 20% of maximal hash value. But that assumes that the hash function is creating an even distribution of hash values.

Consider the following example for a small set:

train_set, test_set = split_train_test_by_id(housing_with_id.iloc[:40], 0.2, "index")
len(test_set)

returns 9 - so the test set is 29% of the overall set, not 20%. For bigger sets, the deviation will occur less often and will be smaller, but it may still be there, I guess.

Or do I miss something?

Jan 08 '22 22:01 KonFey

Hi @KonFey,

Yes, you are 100% correct. Hash functions typically produce hashes that look pretty close to random, roughly following a uniform distribution over the range of possible values. This is not exactly true, but close enough for many applications. So the strategy is very close to randomly picking a set for each instance. This does not guarantee that you'll get exactly the desired test set ratio, but as you mention, it will be pretty close if the dataset is large enough (that's the law of large numbers in action). So unless you are dealing with very small sets, the strategy will work fine.

In contrast, Scikit-Learn's train_test_split() function always returns splits with the desired size (rounded up or down).

Hope this helps.

Jan 10 '22 07:01 ageron