civicmine icon indicating copy to clipboard operation
civicmine copied to clipboard

Can't find T790M mutation in civicmine

Open hongiiv opened this issue 2 years ago • 2 comments

Hi jakelever,

Thanks for this wonderful project.

When i used the civicmine (http://bionlp.bcgsc.ca/civicmine) i can't find "T790M" in any sentence. It was odd for me because EGFR T790M is very famous biomarker in treatment cancer.

This is a tokenizer problem that Spacy language model (en_core_web_sm) tokenizes the "T790M" as a "T790" and "M". (('T790', 'NOUN'), ('M', 'PROPN'))

I changed the kindred package like this (kindred/Parser.py)

if not model in Parser._models:
      Parser._models[model] = spacy.load(model, disable=['ner'])

      self.nlp = Parser._models[model]
      special_case = [{ORTH: "T790M"}]
      self.nlp.tokenizer.add_special_case("T790M", special_case)

Now "T790M" is ('T790M', 'VERB') fixed.

best, jakelever

hongiiv avatar Feb 17 '23 11:02 hongiiv

Hi @hongiiv , thanks for looking into this. I'll have a little dig myself and see what other issues there may be.

jakelever avatar Feb 23 '23 17:02 jakelever

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] avatar Jun 08 '23 05:06 stale[bot]