sherlock-project
sherlock-project copied to clipboard
Wrong predictions while testing new data
I have trained a Sherlock model and it is performing well in test data. But, when I tried test the model and passing the data to model as per ' Sherlock-out-of-the-box' notebook, then it is giving wrong predictions ( even passing training data(in the same way) also results in wrong predictions). Any separate l approach need to be taken for testing the data ? Note : I have created my own paragraph vector w.r.to data I have and using that for training Sherlock model as well.
Hi @SivaNagendra-sn,
Thanks for reporting your problem here. Did you change the identifiers when initializing the model and making inferences with it? The "sherlock"
identifiers in the respective parts of the notebook should be replaced with the identifier that you gave to the newly trained model.
Madelon
Hi @madelonhulsebos Thanks for the reply. I have replaced the paragraph vector file(.pkl) for extracting features and training Sherlock model. Identifiers means which are under 'feature column identifiers(.tsv files)''. If so, we have not changed anything in that .tsv files, Also can u elaborate on what needs to be changed there? If not, can u explain what actually those identifiers are ?
Hi @SivaNagendra-sn,
To use the model retrained with the new paragraph vector files, the model_id
occurrences in the notebook ("sherlock"
in the attached screen shot) should be replaced with the identifier of the new model:
No changes should be made to the feature identifiers in the .tsv
files. I hope that helps.
Yeah, I have actually done that. While training the Sherlock model I have mentioned the model_id as "retrained_sherlock". While working with predict function also we are mentioning model_id as "retrained_sherlock". For test data it is giving the results with good accuracy. But testing with new data( i.e, extracting features with 'extract_features' function then using predict function with model_id mentioned as retrained_sherlock also) the predictions were totally wrong ☹️.
I have retraining and prediction working on new data but if it's mostly text type of fields. For numeric fields and length of size 12 or more it does not work well.. prediction vector returned is null even though classification score and output for test data looks good . Do you have any suggetions? @madelonhulsebos
OK, that should be alright then @SivaNagendra-sn. Is your training data formatted exactly as the original training data (as downloaded through the data download)? The feature extraction pipeline expects "stringified" lists. The input data may be wrong in your case as well, @iganand.
I am getting null in prediction vector. Although classification report for that specific field looks to F1 score .87. What might be the reason