bioasq-biobert icon indicating copy to clipboard operation
bioasq-biobert copied to clipboard

Covid-19 papers

Open tonyreina opened this issue 5 years ago • 3 comments

I was thinking of using BioBERT-BioASQ as a webservice for people to scan Covid-19 articles ("context") and ask questions about them. One thing I wasn't sure of was the sequence length. I think these have to be 384 tokens or less. If I fine-tune the model can I expand the sequence length to be something more like 2048 tokens? Would that affect the accuracy? Or are there better ways to handle full length articles as the context? Thanks. -Tony

tonyreina avatar Mar 20 '20 03:03 tonyreina

Hi @tonyreina, actually we are preparing a webservice for COVID-19 papers and it will be available soon. To answer your question, the sequence length longer than 384 can be sliced with a 384-token window, which is how BERT processes long sequences. It definitely affects the accuracy (mostly leads to lower acc) and you would need to properly normalize the tokens across the multiple sequences. Clark's paper on this matter might help. Thanks.

jhyuklee avatar Mar 20 '20 04:03 jhyuklee

Very cool. I work at Intel and we're interested in helping out wherever we can. Is there anything we could do to help? I'm wondering if you need compute resources or programming help in deploying. Please let me know.

tonyreina avatar Mar 20 '20 04:03 tonyreina

Thank you, Tony. As soon as we are ready for the deployment, we will ask for the help. I'll let you know whenever we are ready. Thanks.

jhyuklee avatar Mar 20 '20 06:03 jhyuklee