BioSentVec icon indicating copy to clipboard operation
BioSentVec copied to clipboard

Invalid words in vocabulary?

Open kaushikacharya opened this issue 3 years ago • 0 comments

While exploring nearest neighbors, have seen many words which seem to be invalid words.

import fasttext
model = fasttext.load_model('./BioSentVec/models/BioWordVec_PubMed_MIMICIII_d200.bin')
model.get_nearest_neighbors('kidney', 30)

This gives the following output:

[(0.9160109162330627, u'kidney*'),
(0.9024562239646912, u'kidney=='),
(0.8989526033401489, u'kidneyks'),
(0.8850656747817993, u'kidney=5'),
(0.8817461133003235, u'kidney-kidney'),
(0.878646731376648, u'1kidney'),
(0.8774275183677673, u'2kidney'),
(0.87574702501297, u'kidney.6'),
(0.8753364682197571, u'qkidney'),
(0.8732652068138123, u'vkidney'),
(0.8732365369796753, u'kidney=48'),
(0.8726592659950256, u'kidneyx2'),
(0.8723018765449524, u'kidney.7'),
(0.8717607259750366, u'kidney2'),
(0.8697896003723145, u'kidneyys'),
(0.8693934679031372, u'kidneyl'),
(0.8692406415939331, u'kidneyds'),
(0.8683575987815857, u'lkidney'),
(0.8680075407028198, u'kidney*liver'),
(0.8666210174560547, u'kidney.5'),
(0.866155207157135, u'e1kidney'),
(0.8647593855857849, u'ckidney'),
(0.8646546006202698, u'ekidney'),
(0.8636531233787537, u'kidney~the'),
(0.861596941947937, u'kidney.2'),
(0.8599884510040283, u'dkidney'),
(0.8594629764556885, u'kidney.3'),
(0.8585801124572754, u'=kidney'),
(0.8581786155700684, u'vtkidney'),
(0.858029305934906, u'kidneywith')]

Wondering what are these words. Are these coming from acupuncture points? e.g. kidney2, kidney.2 - Do these represent http://www.acupuncture.com/education/points/kidney/kid2.htm ?

  • Even if that's the case, is it correct to generate words from the phrase kidney 2 ?
  • Or is that pre-processing wasn't done properly?

But when I use FastText's model, it returns expected nearest words:

model = fasttext.load_model('./models/cc.en.300.bin')
model.get_nearest_neighbors('kidney', 30)
[(0.7705090045928955, u'renal'),
(0.7571945786476135, u'kidneys'),
(0.7136564254760742, u'Kidney'),
(0.6960737109184265, u'kindey'),
(0.6932832598686218, u'liver'),
(0.6215611100196838, u'gallbladder'),
(0.6096128225326538, u'kidney-'),
(0.592450737953186, u'kidney-related'),
(0.5883890390396118, u'lung'),
(0.5875317454338074, u'Renal'),
(0.5851610898971558, u'kidney.'),
(0.580848753452301, u'dialysis'),
(0.5669795870780945, u'Kidneys'),
(0.565768301486969, u'pre-renal'),
(0.5617753267288208, u'hydronephrotic'),
(0.5602078437805176, u'non-renal'),
(0.5586943030357361, u'extra-renal'),
(0.557516872882843, u'ureter'),
(0.5568706393241882, u'hydronephrosis'),
(0.5556935667991638, u'nephrosis'),
(0.5507169961929321, u'extrarenal'),
(0.5478389859199524, u'bladder'),
(0.5455406904220581, u'nephritis'),
(0.540409505367279, u'pancreas'),
(0.538938045501709, u'gall-bladder'),
(0.5365235805511475, u'TEENney'),
(0.5338416695594788, u'pancreatic'),
(0.5323835611343384, u'ureteric'),
(0.5321975946426392, u'glomerular'),
(0.5308919548988342, u'prerenal')]

kaushikacharya avatar Jul 03 '20 05:07 kaushikacharya