Bio-Epidemiology-NER icon indicating copy to clipboard operation
Bio-Epidemiology-NER copied to clipboard

Breaking word while labelling

Open parthplc opened this issue 2 years ago • 3 comments

Hey, Kudos for the amazing work on biomedical ner. Really awesome how good it is. But sometimes it breaks a word into multiple tokens and labels them which is kinda weird. Can we stop the model from doing that?

eg :

{
    "entity_group": "Administration",
    "score": 0.46949705481529236,
    "word": "thor",
    "start": 424,
    "end": 428
  },
  {
    "entity_group": "Medication",
    "score": 0.7422544360160828,
    "word": "##ugh",
    "start": 428,
    "end": 431
  }

parthplc avatar Oct 31 '22 05:10 parthplc

Hi Parth, Thanks for your review. I am currently working on it, will update once this is done.

dreji18 avatar Oct 31 '22 08:10 dreji18

Hi Parth, Thanks for your review. I am currently working on it, will update once this is done.

Hi Deepak, this seems to be still an issue, at least on the Huggingface version of the model. Is there an update on it?

tizianococcio avatar Aug 22 '23 13:08 tizianococcio

Hi. Thanks for the model. Great work! I'm seeing the same here (^=break): An^esthesia, arthros^copic I'm looking at the training files now and will let you know if I find the reason.

svanschalkwyk avatar Nov 11 '23 03:11 svanschalkwyk