nlu icon indicating copy to clipboard operation
nlu copied to clipboard

using NLU for biobert embeddings -- takes a really long time on list of 10,000 words, and on 1 word

Open krico1 opened this issue 3 years ago • 3 comments

Hi, so we are working on generating biobert embeddings for our project. When we run it on a single word it takes about a second or so. When we run on a list of 10,000 words, it either times out or takes upwards of hours to run. Is this normal? Below is how we are using it:

def load_biobert(self): # Load BioBERT model (for sentence-type embeddings) self.logger.info("Loading BioBERT model...") start = time.time() biobert = nlu.load('en.embed_sentence.biobert.pmc_base_cased') end = time.time() self.logger.info('done (BioBERT loading time: %.2fs seconds)', end - start) return biobert

def get_biobert_embeddings(self, strings): embedding_list = [] for string in strings: self.logger.debug("...Generating embedding for: %s", string) embedding_list.append(self.get_biobert_embedding(string)) return embedding_list

def get_biobert_embedding(self, string): embedding = self.biobert.predict(string, output_level='sentence', get_embeddings=True) return embedding.sentence_embedding_biobert.values[0]

krico1 avatar Feb 24 '22 18:02 krico1

Hi @krico1 large embeddings like biobert can be quite slow because of the large deep learning models used for it. But you can also achieve ~ 10x speedup by using NLU in GPU mode

All you need to do is set gpu=True and make sure the GPU is available to Tesnroflow beforehand. Then you can just call the following to get the GPU pipe nlu.load('en.embed_sentence.biobert.pmc_base_cased', gpu=True)

image

See this notebook as a reference.

Also note: If you have a large dataset at hand, it will be faster to feed NLU all the data at once instead of one by one

C-K-Loan avatar Feb 27 '22 14:02 C-K-Loan

@C-K-Loan Hi! Unfortunately, I am not able to obtain the embeddings (even when adding get_embeddings= True). I tried with multiple models, and by including other parameters but with no success. In particular, nlu.load(biobert).predict("random sentence", output_level='token', get_embeddings= True) does not give the expected output, I thought the column was being dropped so I added drop_irrelevant_cols= False but still no success.

thank you!

MargheCap avatar May 02 '22 07:05 MargheCap

@C-K-Loan I have the same problem as @MargheCap . I assume it has something to do with how we install the nlu package. Could you share how you install it?

With my installation (below), I get this rather slow calculation: image

And I checked the GPU visibility to Tensorflow:

import tensorflow as tf
tf.config.list_physical_devices('GPU')

gives --> [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

For installation, I used:

!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
import nlu
pipe = nlu.load('en.embed_sentence.biobert.pmc_base_cased', gpu=True) 

I used this installation because it was proposed in this colab-sheet: https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/component_examples/sentence_embeddings/NLU_BERT_sentence_embeddings_and_t-SNE_visualization_Example.ipynb#scrollTo=rBXrqlGEYA8G Furthermore, the quick_start_google_colab.ipynb brought forth here ( https://nlp.johnsnowlabs.com/docs/en/install#google-colab-notebook ) utilises from sparknlp.pretrained import PretrainedPipeline , but I don't know how to load it. Using pipe = PretrainedPipeline('en.embed_sentence.biobert.pmc_base_cased', gpu=True) gives an errer: ...unexpected keyword argument 'gpu'

runfish5 avatar Dec 03 '22 08:12 runfish5