handson-ml2 icon indicating copy to clipboard operation
handson-ml2 copied to clipboard

[QUESTION] Chapter 9 Label propagation

Open FatihMercan61 opened this issue 3 years ago • 3 comments

I dont get these lines of the code

k=50
y_representative_digits = np.array([4, 8, 0, 6, 8, 3, ..., 7, 6, 2, 3, 1, 1])
y_train_propagated = np.empty(len(X_train), dtype=np.int32) 

for i in range(k):
    y_train_propagated[kmeans.labels_==i] = y_representative_digits[i]

If for example the i of the for loop is 2. Then the indices of y_train_propagated[kmeans.labels_==i] (where kmeans.labels_==2 is true) will be set to 0 because y_representative_digits[2] is equivalent to 0, right?

So the indices of y_train_propagated where kmeans.labels_ is equal to 2 will be set to zero. The label is 2 but it is set to 0. Wouldn't that be wrong?

FatihMercan61 avatar May 17 '22 20:05 FatihMercan61

Hi @FatihMercan61 , Thanks for your question! There are two types of labels here: class labels, and cluster labels. An image's class label corresponds to the digit that this image represents: it's a number from 0 to 9. An image's cluster label is the ID of the cluster that the image belongs to (in this case, a number from 0 to 49, since there are 50 clusters). Since there are multiple ways of writing any digit, there will be multiple clusters for each digit. After grouping the images into 50 clusters, and finding the most representative image of each cluster, we manually look at each of these 50 most representative images and we write down their class labels. This gives us the array y_representative_digits, which contains 50 class labels. For example, the first cluster (at index 0) corresponds to a digit 4, the second (at index 1) corresponds to an 8, etc. Then we want to propagate these class labels to every image in their corresponding clusters. So y_train_propagated will be an array of class labels, with one class label per image in the training set. So when we iterate over k, we are iterating over clusters, not classes. Therefore kmeans.labels_==i finds all images in the ith cluster. For all the images in this cluster, we want to use the same class label as the representative image of that cluster: y_representative_digits[i].

Hope this helps!

ageron avatar May 17 '22 21:05 ageron

Hi ageron,

Thank you very much! Now I get it

FatihMercan61 avatar May 17 '22 22:05 FatihMercan61

Hi @FatihMercan61 , Thanks for your question! There are two types of labels here: class labels, and cluster labels. An image's class label corresponds to the digit that this image represents: it's a number from 0 to 9. An image's cluster label is the ID of the cluster that the image belongs to (in this case, a number from 0 to 49, since there are 50 clusters). Since there are multiple ways of writing any digit, there will be multiple clusters for each digit. After grouping the images into 50 clusters, and finding the most representative image of each cluster, we manually look at each of these 50 most representative images and we write down their class labels. This gives us the array y_representative_digits, which contains 50 class labels. For example, the first cluster (at index 0) corresponds to a digit 4, the second (at index 1) corresponds to an 8, etc. Then we want to propagate these class labels to every image in their corresponding clusters. So y_train_propagated will be an array of class labels, with one class label per image in the training set. So when we iterate over k, we are iterating over clusters, not classes. Therefore kmeans.labels_==i finds all images in the ith cluster. For all the images in this cluster, we want to use the same class label as the representative image of that cluster: y_representative_digits[i].

Hope this helps!

awesome explanation, thank you a lot!

ghost avatar May 06 '23 14:05 ghost