spec2vec
spec2vec copied to clipboard
Getting nan score for spec2vec_similarity function
Hi, Using two similarity functions to compute similarity scores between spectra of two files:
- Spec2vec: scores = calculate_scores(ref_spectrums, query_spectrums, spec2vec_similarity)
- CosineGreedy: scores = calculate_scores(references=spectrums1, queries=spectrums2, similarity_function=CosineGreedy()) However, I've got nan scores for spec2vec that I think it's because of low similarity between pairs of spectra. for example: the results of similarity functions between two spectra
- spec2vec Reference scan id: F1:2478 Query scan id: 3350 Score: [nan]
- CosineGreedy Reference scan id: F1:2478 Query scan id: 3850 Score: [0.004275957907034389, 4] I tried to change the allowed missing percentage from 5 to higher value but it didn't work. Could you please tell me how I can get a score rather than nan by applying spec2vec similarity function? Thanks!
This could be multiple things. Usually I would expect the score to be 0 if something went wrong.
How dit you get Score
?
Here is the code I used to calculate the similarity score for two files containing 5 spectra (just for test):
def calculate_similarity_spec2vec (ref_file, query_file, model_file):
# Load reference spectrums
ref_spectrums = load_data(ref_file)
# create spectrum documents
ref_documents = create_spectrum_documents(ref_spectrums)
query_spectrums = load_data(query_file)
query_documents = create_spectrum_documents(query_spectrums)
# build model
#model= create_model(ref_documents, model_file)
model= create_model(query_documents, model_file)
# Load query spectrums
# Define similarity function
spec2vec_similarity = Spec2Vec(model=model, intensity_weighting_power=0.5,
allowed_missing_percentage=5.0)
# Calculate scores on all combinations of reference and query spectrums
#scores = calculate_scores(ref_documents, query_spectrums, spec2vec_similarity)
scores = calculate_scores(ref_spectrums, query_spectrums, spec2vec_similarity)
scores is ndarray with shape(5,5) containing 'nan' valuse. When I set allowed_missing_percentage t0 .98, the scores files would be high that is not correct. When the ref and query files are the same, it returns scores with high value. The similarity scores that returned by ModifiedCosine similarity function is not high for each spectra pairs but the values are between 0 and 0.03.
I learned this issue happened because there is low similarity between spectra so the missing_percentage is bigger than allowed_missing_percentage in _check_model_coverage function. For example, for the files that I tested the calculated missing percentage was around 86 so I set allowed_missing_percentage to 88, but it calculated high similarity scores which is not correct: [[ nan nan nan nan nan] [0.90276866 0.89489005 0.93134088 0.91878785 0.92963196] [0.91237498 0.90594655 0.94265932 0.92595671 0.94527687] [0.92896287 0.92476147 0.95530618 0.93518096 0.95431371] [0.93886538 0.92957287 0.95451101 0.93058971 0.95301684]] I think there is not way to fix it, right?