flair
flair copied to clipboard
Can Hunflair detect unseen Disease name?
A clear and concise description of what you want to know.
Hi @mennatallah644 ,
sure, HunFlair can detect disease names even if the model hasn't seen the exact disease during training. Have you noticed a missing annotation in your texts?
Hi @mariosaenger i tested the model with corona virus and Covid-19 and it didn't recognized them How can i improve the performance of the Hunflair model? i have list of diseases i want the model to recognize but i don't have dataset i just have the labels ? also how can i optimize the speed of Hunflair?
Hi @mennatallah644 ,
on which kind of text do you apply the model? Note that HunFlair is focused on scientific biomedical literature. Furthermore, the prediction quality depends on the context and the surface form of the disease mention. For example, running the following snippet
from flair.data import Sentence
from flair.models import MultiTagger
from flair.tokenization import SciSpacyTokenizer
hunflair_tagger = MultiTagger.load("hunflair")
tokenizer = SciSpacyTokenizer()
sentences = [
Sentence("In this study, we are investigating 19 patients with SARS-CoV-2.", use_tokenizer=tokenizer),
Sentence("In this study, we are investigating 19 patients suffering COVID-19.", use_tokenizer=tokenizer),
Sentence("In this study, we are investigating 19 patients suffering Covid.", use_tokenizer=tokenizer),
]
hunflair_tagger.predict(sentences)
for sentence in sentences:
print(sentence.to_original_text())
print(sentence.get_spans("hunflair-disease"))
... leads to:
In this study, we are investigating 19 patients with SARS-CoV-2.
[<Disease-span (11,12,13): "SARS - CoV-2">]
In this study, we are investigating 19 patients suffering COVID-19.
[<Disease-span (11): "COVID-19">]
In this study, we are investigating 19 patients suffering covid.
[]
Unfortunately, there is no simple way to add a list of new diseases to the model (without having a data set with annotated texts). Ad-hoc, I would recommend you to do an additional post-processing step via (approximate) string matching.
also how can i optimize the speed of Hunflair?
Can you elaborate a bit more about your computing environment (cpu/gpu?, memory?) and how you run the model? For the latter, do you predict multiple sentences at once and re-use the tokenizer (both - as shown in the snippet above)?
Hello @mariosaenger the machine i am running Hunflair on is google colab which offers K80 machine with 12 gb ram and 78 gb disk what if i need to decrease the time of Hunflair should i go for more ram or more disk or more powerful gpu what is your suggestions? when i run multiple sentences i use for loop and at each time i pass one sentence to the Hunflair? is there a way to pass all test data at once so that i can shrink time? sorry for the late reply
Hi @mennatallah644,
I would suggest to opt for a (better) GPU - with this you will likely get the biggest performance improvements. You can simply provide all sentences at once to the tagger, like in the example above:
tokenizer = SciSpacyTokenizer()
sentences = [
Sentence("In this study, we are investigating 19 patients with SARS-CoV-2.", use_tokenizer=tokenizer),
Sentence("In this study, we are investigating 19 patients suffering COVID-19.", use_tokenizer=tokenizer),
Sentence("In this study, we are investigating 19 patients suffering Covid.", use_tokenizer=tokenizer),
# more sentences ...
]
hunflair_tagger.predict(sentences)
This should also speedup predictions.
Hello @mariosaenger , I tried to provide all sentences at once and it actually speedup the prediction, but
sentences = [
Sentence("In this study, we are investigating 19 patients with SARS-CoV-2.", use_tokenizer=tokenizer),
Sentence("In this study, we are investigating 19 patients suffering COVID-19.", use_tokenizer=tokenizer),
Sentence("In this study, we are investigating 19 patients suffering Covid.", use_tokenizer=tokenizer),
more sentences ...
]
these lines of code actually i do it in for loop as i am reading text from file so i do it as
for i in test_list:
sent = Sentence( i , use_tokenizer=tokenizer)
sentences.append(sent)
and it takes too much time when it loops for more sentences is there a way to optimize it so that i can provide all the sentences without looping
You may opt for a simpler tokenizer, e.g. SegTok or WhitespaceTokenizer, however this will lead to more inaccurate prediction quality (in general). How many sentences do you want to process?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Maybe some parallelization could leverage the Colab HW better?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I am still interested in this - maybe remove the stale bot here?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Still interested...
@dumblob what are you interested in? The original question has been answered I would say.
I am just interested in the optimization of speed - if the bottle neck is reading from a file then I would welcome an end-to-end example how to approach it :wink:.
Otherwise yes, the question has been thoroughly answered and I am thankful for that! Thanks @mariosaenger !
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Anyone with an end-to-end example how to avoid/suppress the bottle-neck of reading a file?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.