What types of entities can each scispaCy model recognize?
Hello,
First, thank you for developing and maintaining the scispaCy package — it’s an impressive tool and a valuable contribution to the field of biomedical NLP.
I’m currently experimenting with the en_core_sci_md model, and I would like to better understand what types of entities it is designed to recognize. For example, when testing the following text:
"""
The patient is a 58-year-old male with a history of type 2 diabetes and hypertension.
He presents with chest pain and shortness of breath for the past two hours.
In the emergency room, a troponin test was ordered, which came back elevated.
An urgent coronary angiography was performed, and the patient was started on aspirin and atorvastatin.
He has a known penicillin allergy.
His smoking history is considered a major risk factor.
"""
All the words in bold were the ones that I wanted to extract as entities, but the model only extracted the following:
- patient (ENTITY)
- male (ENTITY)
- history of type 2 diabetes (ENTITY)
- hypertension (ENTITY)
- chest pain (ENTITY)
- shortness of breath (ENTITY)
- hours (ENTITY)
Could you please point me to documentation or resources that describe the entity types covered by this model, so that I can better anticipate what it can and cannot extract?
Thank you very much for your time and for your excellent work on scispaCy!
Hi, the NER model is trained on the MedMentions dataset (https://aclanthology.org/P18-1010/). This is documented in the scispacy paper (https://arxiv.org/abs/1902.07669). Per the paper, MedMentions was annotated for entities that can be linked to UMLS. There are a few more specific NER models listed in the readme (https://github.com/allenai/scispacy?tab=readme-ov-file#available-models), with each one identifying entities in line with the corpus it was trained on. Of course, these models are also pretty old at this point, and some of the things you identify above simply seem like model mistakes :)