flair icon indicating copy to clipboard operation
flair copied to clipboard

Can Hunflair detect unseen Disease name?

Open mennatallah644 opened this issue 3 years ago • 11 comments

A clear and concise description of what you want to know.

mennatallah644 avatar Sep 13 '21 13:09 mennatallah644

Hi @mennatallah644 ,

sure, HunFlair can detect disease names even if the model hasn't seen the exact disease during training. Have you noticed a missing annotation in your texts?

mariosaenger avatar Sep 14 '21 07:09 mariosaenger

Hi @mariosaenger i tested the model with corona virus and Covid-19 and it didn't recognized them How can i improve the performance of the Hunflair model? i have list of diseases i want the model to recognize but i don't have dataset i just have the labels ? also how can i optimize the speed of Hunflair?

mennatallah644 avatar Sep 16 '21 10:09 mennatallah644

Hi @mennatallah644 ,

on which kind of text do you apply the model? Note that HunFlair is focused on scientific biomedical literature. Furthermore, the prediction quality depends on the context and the surface form of the disease mention. For example, running the following snippet

from flair.data import Sentence
from flair.models import MultiTagger
from flair.tokenization import SciSpacyTokenizer

hunflair_tagger = MultiTagger.load("hunflair")

tokenizer = SciSpacyTokenizer()
sentences = [
    Sentence("In this study, we are investigating 19 patients with SARS-CoV-2.", use_tokenizer=tokenizer),
    Sentence("In this study, we are investigating 19 patients suffering COVID-19.", use_tokenizer=tokenizer),
    Sentence("In this study, we are investigating 19 patients suffering Covid.", use_tokenizer=tokenizer),
]
hunflair_tagger.predict(sentences)

for sentence in sentences:
    print(sentence.to_original_text())
    print(sentence.get_spans("hunflair-disease"))

... leads to:

In this study, we are investigating 19 patients with SARS-CoV-2.
[<Disease-span (11,12,13): "SARS - CoV-2">]

In this study, we are investigating 19 patients suffering COVID-19.
[<Disease-span (11): "COVID-19">]

In this study, we are investigating 19 patients suffering covid.
[]

Unfortunately, there is no simple way to add a list of new diseases to the model (without having a data set with annotated texts). Ad-hoc, I would recommend you to do an additional post-processing step via (approximate) string matching.

also how can i optimize the speed of Hunflair?

Can you elaborate a bit more about your computing environment (cpu/gpu?, memory?) and how you run the model? For the latter, do you predict multiple sentences at once and re-use the tokenizer (both - as shown in the snippet above)?

mariosaenger avatar Sep 16 '21 13:09 mariosaenger

Hello @mariosaenger the machine i am running Hunflair on is google colab which offers K80 machine with 12 gb ram and 78 gb disk what if i need to decrease the time of Hunflair should i go for more ram or more disk or more powerful gpu what is your suggestions? when i run multiple sentences i use for loop and at each time i pass one sentence to the Hunflair? is there a way to pass all test data at once so that i can shrink time? sorry for the late reply

mennatallah644 avatar Oct 13 '21 08:10 mennatallah644

Hi @mennatallah644,

I would suggest to opt for a (better) GPU - with this you will likely get the biggest performance improvements. You can simply provide all sentences at once to the tagger, like in the example above:

tokenizer = SciSpacyTokenizer()
sentences = [
    Sentence("In this study, we are investigating 19 patients with SARS-CoV-2.", use_tokenizer=tokenizer),
    Sentence("In this study, we are investigating 19 patients suffering COVID-19.", use_tokenizer=tokenizer),
    Sentence("In this study, we are investigating 19 patients suffering Covid.", use_tokenizer=tokenizer),
   # more sentences ...
]
hunflair_tagger.predict(sentences)

This should also speedup predictions.

mariosaenger avatar Oct 15 '21 12:10 mariosaenger

Hello @mariosaenger , I tried to provide all sentences at once and it actually speedup the prediction, but

sentences = [
    Sentence("In this study, we are investigating 19 patients with SARS-CoV-2.", use_tokenizer=tokenizer),
    Sentence("In this study, we are investigating 19 patients suffering COVID-19.", use_tokenizer=tokenizer),
    Sentence("In this study, we are investigating 19 patients suffering Covid.", use_tokenizer=tokenizer),
   more sentences ...
]

these lines of code actually i do it in for loop as i am reading text from file so i do it as

for i in test_list:
    sent = Sentence( i , use_tokenizer=tokenizer)
    sentences.append(sent)

and it takes too much time when it loops for more sentences is there a way to optimize it so that i can provide all the sentences without looping

mennatallah644 avatar Oct 15 '21 12:10 mennatallah644

You may opt for a simpler tokenizer, e.g. SegTok or WhitespaceTokenizer, however this will lead to more inaccurate prediction quality (in general). How many sentences do you want to process?

mariosaenger avatar Oct 15 '21 13:10 mariosaenger

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Feb 13 '22 12:02 stale[bot]

Maybe some parallelization could leverage the Colab HW better?

dumblob avatar Feb 13 '22 14:02 dumblob

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 18 '22 19:06 stale[bot]

I am still interested in this - maybe remove the stale bot here?

dumblob avatar Jun 18 '22 20:06 dumblob

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 22 '22 19:10 stale[bot]

Still interested...

dumblob avatar Oct 22 '22 21:10 dumblob

@dumblob what are you interested in? The original question has been answered I would say.

alanakbik avatar Oct 25 '22 15:10 alanakbik

I am just interested in the optimization of speed - if the bottle neck is reading from a file then I would welcome an end-to-end example how to approach it :wink:.

Otherwise yes, the question has been thoroughly answered and I am thankful for that! Thanks @mariosaenger !

dumblob avatar Oct 25 '22 19:10 dumblob

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 18 '23 19:03 stale[bot]

Anyone with an end-to-end example how to avoid/suppress the bottle-neck of reading a file?

dumblob avatar Mar 19 '23 20:03 dumblob

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 12 '23 19:08 stale[bot]