nlu
nlu copied to clipboard
using NLU for biobert embeddings -- takes a really long time on list of 10,000 words, and on 1 word
Hi, so we are working on generating biobert embeddings for our project. When we run it on a single word it takes about a second or so. When we run on a list of 10,000 words, it either times out or takes upwards of hours to run. Is this normal? Below is how we are using it:
def load_biobert(self): # Load BioBERT model (for sentence-type embeddings) self.logger.info("Loading BioBERT model...") start = time.time() biobert = nlu.load('en.embed_sentence.biobert.pmc_base_cased') end = time.time() self.logger.info('done (BioBERT loading time: %.2fs seconds)', end - start) return biobert
def get_biobert_embeddings(self, strings): embedding_list = [] for string in strings: self.logger.debug("...Generating embedding for: %s", string) embedding_list.append(self.get_biobert_embedding(string)) return embedding_list
def get_biobert_embedding(self, string): embedding = self.biobert.predict(string, output_level='sentence', get_embeddings=True) return embedding.sentence_embedding_biobert.values[0]
Hi @krico1 large embeddings like biobert can be quite slow because of the large deep learning models used for it. But you can also achieve ~ 10x speedup by using NLU in GPU mode
All you need to do is set gpu=True
and make sure the GPU is available to Tesnroflow beforehand.
Then you can just call the following to get the GPU pipe
nlu.load('en.embed_sentence.biobert.pmc_base_cased', gpu=True)
See this notebook as a reference.
Also note: If you have a large dataset at hand, it will be faster to feed NLU all the data at once instead of one by one
@C-K-Loan Hi! Unfortunately, I am not able to obtain the embeddings (even when adding get_embeddings= True). I tried with multiple models, and by including other parameters but with no success. In particular, nlu.load(biobert).predict("random sentence", output_level='token', get_embeddings= True) does not give the expected output, I thought the column was being dropped so I added drop_irrelevant_cols= False but still no success.
thank you!
@C-K-Loan I have the same problem as @MargheCap . I assume it has something to do with how we install the nlu
package. Could you share how you install it?
With my installation (below), I get this rather slow calculation:
And I checked the GPU visibility to Tensorflow:
import tensorflow as tf
tf.config.list_physical_devices('GPU')
gives --> [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
For installation, I used:
!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
import nlu
pipe = nlu.load('en.embed_sentence.biobert.pmc_base_cased', gpu=True)
I used this installation because it was proposed in this colab-sheet: https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/component_examples/sentence_embeddings/NLU_BERT_sentence_embeddings_and_t-SNE_visualization_Example.ipynb#scrollTo=rBXrqlGEYA8G
Furthermore, the quick_start_google_colab.ipynb brought forth here ( https://nlp.johnsnowlabs.com/docs/en/install#google-colab-notebook ) utilises from sparknlp.pretrained import PretrainedPipeline
, but I don't know how to load it. Using pipe = PretrainedPipeline('en.embed_sentence.biobert.pmc_base_cased', gpu=True)
gives an errer: ...unexpected keyword argument 'gpu'