flair Can Hunflair detect unseen Disease name?

A clear and concise description of what you want to know.

Sep 13 '21 13:09 mennatallah644

Hi @mennatallah644 ,

sure, HunFlair can detect disease names even if the model hasn't seen the exact disease during training. Have you noticed a missing annotation in your texts?

Sep 14 '21 07:09 mariosaenger

Hi @mariosaenger i tested the model with corona virus and Covid-19 and it didn't recognized them How can i improve the performance of the Hunflair model? i have list of diseases i want the model to recognize but i don't have dataset i just have the labels ? also how can i optimize the speed of Hunflair?

Sep 16 '21 10:09 mennatallah644

Hi @mennatallah644 ,

on which kind of text do you apply the model? Note that HunFlair is focused on scientific biomedical literature. Furthermore, the prediction quality depends on the context and the surface form of the disease mention. For example, running the following snippet

from flair.data import Sentence
from flair.models import MultiTagger
from flair.tokenization import SciSpacyTokenizer

hunflair_tagger = MultiTagger.load("hunflair")

tokenizer = SciSpacyTokenizer()
sentences = [
    Sentence("In this study, we are investigating 19 patients with SARS-CoV-2.", use_tokenizer=tokenizer),
    Sentence("In this study, we are investigating 19 patients suffering COVID-19.", use_tokenizer=tokenizer),
    Sentence("In this study, we are investigating 19 patients suffering Covid.", use_tokenizer=tokenizer),
]
hunflair_tagger.predict(sentences)

for sentence in sentences:
    print(sentence.to_original_text())
    print(sentence.get_spans("hunflair-disease"))

... leads to:

In this study, we are investigating 19 patients with SARS-CoV-2.
[<Disease-span (11,12,13): "SARS - CoV-2">]

In this study, we are investigating 19 patients suffering COVID-19.
[<Disease-span (11): "COVID-19">]

In this study, we are investigating 19 patients suffering covid.
[]

Unfortunately, there is no simple way to add a list of new diseases to the model (without having a data set with annotated texts). Ad-hoc, I would recommend you to do an additional post-processing step via (approximate) string matching.

also how can i optimize the speed of Hunflair?

Can you elaborate a bit more about your computing environment (cpu/gpu?, memory?) and how you run the model? For the latter, do you predict multiple sentences at once and re-use the tokenizer (both - as shown in the snippet above)?

Sep 16 '21 13:09 mariosaenger

Hello @mariosaenger the machine i am running Hunflair on is google colab which offers K80 machine with 12 gb ram and 78 gb disk what if i need to decrease the time of Hunflair should i go for more ram or more disk or more powerful gpu what is your suggestions? when i run multiple sentences i use for loop and at each time i pass one sentence to the Hunflair? is there a way to pass all test data at once so that i can shrink time? sorry for the late reply

Oct 13 '21 08:10 mennatallah644

Hi @mennatallah644,

I would suggest to opt for a (better) GPU - with this you will likely get the biggest performance improvements. You can simply provide all sentences at once to the tagger, like in the example above:

tokenizer = SciSpacyTokenizer()
sentences = [
    Sentence("In this study, we are investigating 19 patients with SARS-CoV-2.", use_tokenizer=tokenizer),
    Sentence("In this study, we are investigating 19 patients suffering COVID-19.", use_tokenizer=tokenizer),
    Sentence("In this study, we are investigating 19 patients suffering Covid.", use_tokenizer=tokenizer),
   # more sentences ...
]
hunflair_tagger.predict(sentences)

This should also speedup predictions.

Oct 15 '21 12:10 mariosaenger

Hello @mariosaenger , I tried to provide all sentences at once and it actually speedup the prediction, but

sentences = [
    Sentence("In this study, we are investigating 19 patients with SARS-CoV-2.", use_tokenizer=tokenizer),
    Sentence("In this study, we are investigating 19 patients suffering COVID-19.", use_tokenizer=tokenizer),
    Sentence("In this study, we are investigating 19 patients suffering Covid.", use_tokenizer=tokenizer),
   more sentences ...
]

these lines of code actually i do it in for loop as i am reading text from file so i do it as

for i in test_list:
    sent = Sentence( i , use_tokenizer=tokenizer)
    sentences.append(sent)

and it takes too much time when it loops for more sentences is there a way to optimize it so that i can provide all the sentences without looping

Oct 15 '21 12:10 mennatallah644

You may opt for a simpler tokenizer, e.g. SegTok or WhitespaceTokenizer, however this will lead to more inaccurate prediction quality (in general). How many sentences do you want to process?

Oct 15 '21 13:10 mariosaenger

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Feb 13 '22 12:02 stale[bot]

Maybe some parallelization could leverage the Colab HW better?

Feb 13 '22 14:02 dumblob

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jun 18 '22 19:06 stale[bot]

I am still interested in this - maybe remove the stale bot here?

Jun 18 '22 20:06 dumblob

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Oct 22 '22 19:10 stale[bot]

Still interested...

Oct 22 '22 21:10 dumblob

@dumblob what are you interested in? The original question has been answered I would say.

Oct 25 '22 15:10 alanakbik

I am just interested in the optimization of speed - if the bottle neck is reading from a file then I would welcome an end-to-end example how to approach it :wink:.

Otherwise yes, the question has been thoroughly answered and I am thankful for that! Thanks @mariosaenger !

Oct 25 '22 19:10 dumblob

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Mar 18 '23 19:03 stale[bot]

Anyone with an end-to-end example how to avoid/suppress the bottle-neck of reading a file?

Mar 19 '23 20:03 dumblob

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 12 '23 19:08 stale[bot]

flair flair copied to clipboard

Can Hunflair detect unseen Disease name?

flair
flair copied to clipboard