handson-ml
handson-ml copied to clipboard
Chapter3:Minist_Classify
hello,I am a very new person in sklearn,I have a question while learing chapter3, the books writen this:
I know the goal is getting the descision scores,but why not use sgd_clf.decision_function()
Good question @anyuese . Because cv=3
, the cross_val_predict()
function will split the dataset into 3 distinct parts (called "folds"), then it will create 3 clones of the sgd_clf
, and it will train all of them like this: the kth clone will be trained on all folds except for the kth fold, and it will be used to make predictions for the kth fold. This means almost 3 times more computing is required when calling cross_val_predict()
compared to just calling sgd_clf.decision_function()
. Not quite 3 times, since each clone is trained on just 2/3rds of the training set. But the benefit is that the predictions will be "realistic", in the sense that the model will not have been trained on the data it is making predictions for. So you can get a more precise idea of how well your model is going to perform once it is in production and is fed new data.
I hope this is clear! Note that it is all explained in the book, so don't hesitate to go back and read through the part about K-fold cross-validation, if needed.
Cheers!
Thank you,master.I have another problem
while using knn_fit and knn_predict,it just use a few time,but when using cross validation predicton then using f1_score() it cost me lots of time, I just think cv=3,the computation is around 3 times then just using knn_fit ,knn_predict. And while scoring f1,computation is fewer.But acutually,it's not,and I don't know why
Yes, KNN can be very slow. Try running the code on 1/10th of the dataset to see if it runs smoothly. Normally the cross val functions should be about close to 3 times slower when cv=3
.
Hi @ageron,
Thanks for your explanation above. I have a question regarding the line precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
. Why the length of all the possible threshold is 59698 instead of 60000. As a naive way to think about this is that you can use every y_score as a possible threshold so that you can have 60000 sets of different prediction results.
Thank you in advance. Regards, QY
Hi @qy-yang , Great question! I haven't checked, but I suppose these are all the distinct scores.
Hi @qy-yang, @ageron,
I had the same doubt expressed by @qy-yang. For me, given that the scores for this specific examples are all distinct (when I run len(y_scores)
I get 60000), the point is the one specified here. Basically, the output is omitted for all thresholds that result in full recall, thus causing thresholds
to be shorter than y_scores
.