handson-ml Chapter3:Minist

hello,I am a very new person in sklearn,I have a question while learing chapter3, the books writen this:

I know the goal is getting the descision scores,but why not use sgd_clf.decision_function()

Jan 08 '19 04:01 anyuese

Good question @anyuese . Because cv=3, the cross_val_predict() function will split the dataset into 3 distinct parts (called "folds"), then it will create 3 clones of the sgd_clf, and it will train all of them like this: the kth clone will be trained on all folds except for the kth fold, and it will be used to make predictions for the kth fold. This means almost 3 times more computing is required when calling cross_val_predict() compared to just calling sgd_clf.decision_function(). Not quite 3 times, since each clone is trained on just 2/3rds of the training set. But the benefit is that the predictions will be "realistic", in the sense that the model will not have been trained on the data it is making predictions for. So you can get a more precise idea of how well your model is going to perform once it is in production and is fed new data. I hope this is clear! Note that it is all explained in the book, so don't hesitate to go back and read through the part about K-fold cross-validation, if needed. Cheers!

Jan 08 '19 06:01 ageron

Thank you,master.I have another problem while using knn_fit and knn_predict,it just use a few time,but when using cross validation predicton then using f1_score() it cost me lots of time, I just think cv=3,the computation is around 3 times then just using knn_fit ,knn_predict. And while scoring f1,computation is fewer.But acutually,it's not,and I don't know why

Jan 09 '19 04:01 anyuese

Yes, KNN can be very slow. Try running the code on 1/10th of the dataset to see if it runs smoothly. Normally the cross val functions should be about close to 3 times slower when cv=3.

Jan 09 '19 12:01 ageron

Hi @ageron,

Thanks for your explanation above. I have a question regarding the line precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores). Why the length of all the possible threshold is 59698 instead of 60000. As a naive way to think about this is that you can use every y_score as a possible threshold so that you can have 60000 sets of different prediction results.

Thank you in advance. Regards, QY

Apr 26 '19 06:04 qy-yang

Hi @qy-yang , Great question! I haven't checked, but I suppose these are all the distinct scores.

Apr 26 '19 11:04 ageron

Hi @qy-yang, @ageron, I had the same doubt expressed by @qy-yang. For me, given that the scores for this specific examples are all distinct (when I run len(y_scores) I get 60000), the point is the one specified here. Basically, the output is omitted for all thresholds that result in full recall, thus causing thresholds to be shorter than y_scores.

Nov 14 '20 17:11 AlessandroMiola

handson-ml
handson-ml copied to clipboard

Chapter3:Minist_Classify

handson-ml handson-ml copied to clipboard

Chapter3:Minist_Classify

handson-ml
handson-ml copied to clipboard