BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

GPU saved model gives model.load() error

Open ghost opened this issue 2 years ago • 7 comments

Hi,

I first created instances of GPU-accelerated UMAP and HDBSCAN using: (on google colab pro) umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0) hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)

And then used them in bertopic: model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, calculate_probabilities=False, nr_topics=num_topics) topics, probs = model.fit_transform(docs)

I then saved the model using model.save("name")

While loading the model (on my local machine, macOS) I am getting the following error: RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Any help would be greatly appreciated, thanks!

ghost avatar Jun 19 '22 17:06 ghost

I think that the issue is that there are internal differences in the outputs of HDBSCAN and/or UMAP depending on processor type. You will get the same problem if you just save a BERTopic model created on a different processor type from where you deserialize the BERTopic model. I'm pretty sure there is no work around. Why not just move everything over to Colab+. You won't have this issue.

drob-xx avatar Jun 19 '22 18:06 drob-xx

I was trying to load it into my local machine because I want the BERTopic visualisations. If I load the model on google colab and try to get the visualisations using :

fig = model.visualize_topics() fig.show()

I get a tcmalloc error but my runtime does not restart. I was not able to find a fix for this either.

ghost avatar Jun 19 '22 18:06 ghost

How large is your corpus and what are your BERTopic settings when colab crashes? I'm seeing the same problem on my colab+ instance. I'm guessing it is simply a memory problem as in 'not enough'. If you are interested I can provide you with another visualization method that won't have the same memory footprint that might be interesting (but perhaps off-topic)

drob-xx avatar Jun 19 '22 19:06 drob-xx

My corpus is of about 1.8 million documents. I'm not sure what you mean by BERTopic settings? I was trying it out for about 250 topics but I'm running coherence experiments to find optimal number of topics to separate my data into. What is this other visualisation method? Sounds interesting!

ghost avatar Jun 19 '22 21:06 ghost

I have a corpus of 30K documents and when I tried to visualize after I saw your post it crashed (multiple times) on colab+ with a memory problem - I didn't look to see what the actual issue was - but it was obviously memory related. I don't use the BERTopic visualization tools that much right now, so I'm not that up on what is what.

BERTopic.get_params() is your friend.

You might try selecting out a much smaller sample set. I'm just spitballing here but if you think you might have 250 topics then something like as small as 25K (100 docs avg. per topic) might work just as well.

I've been doing a lot of work with visualizing the embeddings themselves. For me this has been revelatory. Here are some code snippets that should work oob more or less:

# Hard part - You need the embeddings, there are other ways of doing it but hacky works like below.
# my model is BERTNewsEmbedding - just put yours in here

# NewsDF['Content'] is just a raw string for each doc

# This code is clipped out of BERTopic
# This is where BERTopic gets the embeddings - you need to capture them.

documents = pd.DataFrame({"Document": NewsDF['Content'],
                          "ID": range(NewsDF.shape[0]),
                          "Topic": None})
BERTNewsEmbedding = BERTNewsModel._extract_embeddings(documents.Document,
                                                      method='document',
                                                      verbose=True)

# Now you have to reduce to 2D - You can do this with UMAP, but TSNE is 'prettier' in some ways but slower. 
# There are arguments for either. You can also use any other algo you want - try PacMap for something different!

# Here's with UMAP
# no need to install UMAP if you already installed BERTopic

BERT2DReducer = umap.UMAP(n_components=2)
BERT2DReducer.fit(BERTNewsEmbeddings)

# Here's with TSNE

# from sklearn.manifold import TSNE
# X_embedded = TSNE(n_components=2, learning_rate='auto',init='random').fit_transform(BERTNewsEmbeddings)

# Now that you have a 2D representation put it in a data format that you can easily graph

BERT2DDF = pd.DataFrame()
BERT2DDF['bert_x'] = BERT2DReducer.embedding_[:,0]
BERT2DDF['bert_y'] = BERT2DReducer.embedding_[:,1]  

# For added viz you can add the original text as well - 
BERT2DDF['text'] = umap2DDF['text']

# And you of course you should show the topics themselves.

# you can just run HDBSCAN by itself but if you have a BERTopic Model you can do that to keep 
# synchronized with what BERTopic is up to. Remember that the below is running against the 5D 
# UMAP reduction that is the BERTopic default. Not the 2D above which is for the viz.

# you can just substitute other hdbscan params to see what the impact would be
# Remember that if you access the BERTopic model this way and change parameters etc. 
# you will knock the underlying BERTopic model out of sync. 

BERTNewsModel.hdbscan_model.fit(BERTNewsModel.umap_model.embedding_)

# A couple of things about the display params below
# I convert the topic labels from ints to str - this means that the plotly graph will let you click on 
# topics to turn them on/off

# the 'hover_data' option lets you see each document that is projected on the model
# that means you can try to figure out why something may not have categorized or categorized one way or another.
# you may want to cut down on the amount of text and add line breaks <br> for readability 

import plotly.express as px

fig = px.scatter(BERT2DDF, x='bert_x', y='bert_y',
                 color=[str(topic) for topic in BERTNewsModel.hdbscan_model.labels_], 
                 width=1000, 
                 height=850,
                 hover_data=['text'],)

fig.update_traces(marker=dict(size=3),
                  selector=dict(mode='markers')
)
fig.show()

I've found this to be a powerful way to see what is going on and understand my corpi. It is trivial (and very fast) to change HDBSCAN params and see the output. Ultimately, if you get serious about tuning the parameters you'll have to run dozens of runs evaluating for number of topics, size of topics, and number of outliers (or anything else you are interested in). To do that you need a tool - I use wandb.ai. Let me know if you have any questions. Would love to know about your experience.

drob-xx avatar Jun 19 '22 23:06 drob-xx

Thanks for your detailed reply! Would definitely check this out soon.

ghost avatar Jun 23 '22 17:06 ghost

RE: RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

This particular error that you are getting in the original post is actually unrelated to HDBSCAN and UMAP. The embedding model was trained on a CUDA device and is being loaded on a CPU. See a solution here https://github.com/MaartenGr/BERTopic/issues/384

You'll want to be aware that BERTopic has a separate problem with loading existing models that can result in different results https://github.com/MaartenGr/BERTopic/issues/482

recurrence avatar Jul 05 '22 18:07 recurrence

Due to inactivity, I'll be closing this for now. Let me know if you have any other questions related to this and I'll make sure to re-open the issue!

MaartenGr avatar Sep 27 '22 08:09 MaartenGr