BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Transform on new data is quite slow

Open ginward opened this issue 2 years ago • 5 comments

I am currently calling .fit() on the training dataset and .transform() on the out-of-sample dataset. There are about 4 million observations in the training data and 5 million observations in the out-of-sample data. However, it seems BERTopic can finish .fit() in a few hours while .transform() takes a very long time (400 hours).

I have tried calling .fit_tranform() on the entire data (training + out-of-sample), and it can also complete the job within a few hours. However, the task I am running now requires me not to use the out-of-sample data during model training, so this is not optimal.

Any idea why transforming on new data is slow, and will parallelisation help?

ginward avatar Dec 15 '21 11:12 ginward

In the v0.9.4 release of BERTopic, each important step in .transform() is now logged if your set verbose=True so you can see which specific step slows down.

I believe there are two ways it might slow down. First, and this happens most frequently if you have set calculate_probabilities=True. This will in turn run hdbscan.membership_vector which can be quite time-consuming. Second, there have been mentions of the hdbscan.approximate_predict to be slow but I have not experienced this myself. As you mentioned, parallelization might help but that would highly depend on where the model is exactly slowing. Similarly, batching could help if the issue is the former since it does not need to generate such large probability matrices.

MaartenGr avatar Dec 15 '21 11:12 MaartenGr

I have had some success using cuml for gpu-accelerating UMAP which have led to a 20x speed up - however, they do not yet support predicting on new data with hdbscan.

Rysias avatar Jan 03 '22 12:01 Rysias

@ginward @Rysias What might be interesting is to have a look at cuBERTopic, created by @mayankanand007 and @VibhuJawa, which is a GPU-accelerated version of BERTopic using RAPIDS to primarily speed up HDBSCAN and UMAP as those are the main bottlenecks of the application.

I have not tested it out yet but I followed the PR and it seems that quite a bit of work has been put into it to make sure it gets that speedup.

MaartenGr avatar Jan 03 '22 12:01 MaartenGr

Thanks for the ping as well as the work you have done on the library @MaartenGr .

which is a GPU-accelerated version of BERTopic using RAPIDS to primarily speed up HDBSCAN and UMAP as those are the main bottlenecks of the application.

So based on our preliminary benchmarking we are seeing a 41x speed up with the UMAP bit (141 s vs 5898 s on CPU) . HDBSCAN is faster but by a smaller margin of 3x (115 vs 385). (This is because currently HDBScan on GPUs currently is not that fast on the lower dimensional vectors (5 here, which is the default case for the input to HDBSCAN ).

That said the code is still a WIP and we should have a lot of low hanging fruit in other places like (topic and embedding creation) which I plan to address this coming week .

I also wanted to point out the goal of the project was a POC to see if we get a good speed up with GPUs for topic modeling workflows. Eventually, We would love to see this work get upstreamed to BERTopic library directly to enable easy GPU acceleration for all users.

VibhuJawa avatar Jan 03 '22 13:01 VibhuJawa

It seems that the cuBERTopic doesn't (yet?) have a transform method implemented, unfortunately. However, I think that is a matter of when rapids implement predict on new data for their HDBSCAN implementation.

Rysias avatar Jan 04 '22 09:01 Rysias

With the possibility of using cuML in BERTopic and their latest release, this should now be resolved.

MaartenGr avatar Jan 09 '23 12:01 MaartenGr