pymetamap
pymetamap copied to clipboard
position returned by pymetamap
Hi Anthony,
Firstly, thank you for the wonderful implementation of metamap.
However, I was running into some issues while extracting the keywords using pymetamap.
For example, in the sentence itself "John had a huge heart-attack", could you please direct me to how to extract the exact position of the keyword identified by pymetamap. It shows position = 17:12, but in several cases, I see the exact character position is off by 1-2 characters.
Could you provide some insight into this?
Hi ShoRit,
Can you provide some examples? This is going to be an issue with MetaMap, not with the wrapper. But, if you share an example, I can look into it.
Metamap positions are not 0-indexed, that must be why it appears off
Metamap positions are not 0-indexed, that must be why it appears off
@ShoRit @yuliaoh My understanding is that its 0-indexed. MMI output documentation quotes
Positional Information – Bar separated list of positional information doubles showing StartPos, colon (:), and Length of each trigger identified in the Trigger Information field. StartPos begins at position zero (0) of the input text.
Here's the output using MetaMap 2020 release version:
echo "heart attack" | ./public_mm/bin/metamap -N -Q 4 -y --sldi
outputs
USER|MMI|5.18|Myocardial Infarction|C0027051|[dsyn]|["HEART ATTACK"-tx-1-"heart attack"-noun-0]|TX|0/12|
Another example:
echo "John had a huge heart attack" | ./public_mm/bin/metamap -N -Q 4 -y --sldi
outputs
USER|MMI|3.75|Myocardial Infarction|C0027051|[dsyn]|["HEART ATTACK"-tx-1-"heart attack"-noun-0]|TX|16/12|
As you can see, its 0-indexed. I have passed the same input arguments as used by pymetamap.
Then why does pymetamap output is 1-indexed?
Its the way pymetamap passes input text which is the reason it appears to be 1-indexed. Taking the example mentioned in pymetamap Readme:
In [3]: sents = ['Heart Attack', 'John had a huge heart attack']
In [4]: concepts,error = mm.extract_concepts(sents,[1,2])
In [5]: for concept in concepts:
...: print(concept)
...:
ConceptMMI(index='1', mm='MMI', score='5.18', preferred_name='Myocardial Infarction', cui='C0027051', semtypes='[dsyn]', trigger='["HEART ATTACK"-tx-1-"Heart Attack"-noun-0]', location='TX', pos_info='1/12', tree_codes='')
ConceptMMI(index='2', mm='MMI', score='3.75', preferred_name='Myocardial Infarction', cui='C0027051', semtypes='[dsyn]', trigger='["HEART ATTACK"-tx-1-"heart attack"-noun-0]', location='TX', pos_info='17/12', tree_codes='')
Looking into the code why it appears to become 1-indexed in pymetamap's output:
https://github.com/AnthonyMRios/pymetamap/blob/master/pymetamap/SubprocessBackend.py#L174
if input_text is None:
input_text = '{0!r}|{1!r}\n'.format(identifier, sentence).encode('utf8')
else:
input_text += '{0!r}|{1!r}\n'.format(identifier, sentence).encode('utf8')
Have a look at the difference between the two strings:
In [12]: '{0!r}'.format('Heart Attack')
Out[12]: "'Heart Attack'"
In [13]: '{0}'.format('Heart Attack')
Out[13]: 'Heart Attack'
This has been nicely explained by mgilson in https://stackoverflow.com/a/38418132/282155 using example as well as the python documentation.
Nice catch @kaushikacharya. I will look into creating a fix for this, it seems reasonably easy.