[QUESTION] Chapter 2: Definition of `similarities` is subject to information leakage?
This question is referring to the jupyter notebook of Chapter 2.
===
The code below creates new 10 similarity features based on the location of the districts.
But it also uses the information of "median_house_value" as sample weight.
housing_labels = strat_train_set["median_house_value"].copy()
...
similarities = cluster_simil.fit_transform(housing[["latitude", "longitude"]],
sample_weight=housing_labels)
But isn't it kind of information leakage to the model? The model is going to be trained on predicting the median house value and should NOT have any direct information about it.
Using "median_house_value" as sample weight is nonsense because for prediction in the future it shouldn't be available.
On the other hand, the "median_income" feature instead would be adequate for sample weight.
In fact, the sample_weight option is used only for the demonstration of how to use ClusterSimilarity class and is ignored after that. There is therefore no information leakage during the training.
However, it is still misleading to use "median_house_value" as the value for sample_weight. Using instead "median_income" results in almost the same clustering.
Hi @liganega ,
My apologies for the very late reply. You make an excellent point, indeed we should not use the targets for clustering, it would be a leakage. But this code was actually just an example to explain sample_weights, and the training code after that doesn't actually set the sample weights, so the labels are not used to generate features, there's no leakage.
That said, I understand that it's really confusing since it seems like the clusters shown in the diagram are the ones that will be used for training. So I'll remove the sample_weights=housing_labels and just show the actual clusters used during training, which are only based on the latitude & longitude. They too end up with similar clusters since the densest areas are also the most expensive.
I'll also add a note about this in the notebook.
Thanks again!