BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Adding support for cuML's nn-descent with UMAP

Open jemather opened this issue 8 months ago • 5 comments

Feature request

Adding support for nearest neighbor descent when using cuML UMAP

Motivation

RAPIDS cuML added support for nearest-neighbor descent in UMAP, which approximates the full nearest-neighbor ("brute force") graph (NVidia has a good explainer here). This makes it much faster to run UMAP, and in my use case made it feasible to even run UMAP, as I ran out of memory with the brute force approach.

However, this isn't an option that can be set when initializing the model and passing it to BERTopic. Instead, it requires running fit_transform as a single step and including the "data_on_host=True" parameter. But the current version of BERTopic separates fit and transform in _reduce_dimensionality. I just monkey patched it myself and got it working, but I think fixing it would be relatively easy for the full model fit. I'm unsure if it would work with a partial fit, though, so I did not implement it with partial_fit.

Your contribution

    def _reduce_dimensionality(
        self,
        embeddings: Union[np.ndarray, csr_matrix],
        y: Union[List[int], np.ndarray] = None,
        partial_fit: bool = False,
        data_on_host = True # only compatible with using nn_descent, which is an option set in the UMAP model, so setting this may be more complex than what I am doing here.
    ) -> np.ndarray:
        """Reduce dimensionality of embeddings using UMAP and train a UMAP model.

        Arguments:
            embeddings: The extracted embeddings using the sentence transformer module.
            y: The target class for (semi)-supervised dimensionality reduction
            partial_fit: Whether to run `partial_fit` for online learning

        Returns:
            umap_embeddings: The reduced embeddings
        """
        logger.info("Dimensionality - Fitting the dimensionality reduction algorithm")
        # Partial fit
        if partial_fit:
            if hasattr(self.umap_model, "partial_fit"):
                self.umap_model = self.umap_model.partial_fit(embeddings)
            elif self.topic_representations_ is None:
                self.umap_model.fit(embeddings)
            umap_embeddings = self.umap_model.transform(embeddings)

        # Regular fit
        else:
            try:
                # cuml umap needs y to be an numpy array
                y = np.array(y) if y is not None else None
                umap_embeddings = self.umap_model.fit_transform(embeddings, y=y, data_on_host)
            except TypeError:
                umap_embeddings = self.umap_model.fit_transform(embeddings, data_on_host)

        
        logger.info("Dimensionality - Completed \u2713")
        return np.nan_to_num(umap_embeddings)

jemather avatar Apr 14 '25 03:04 jemather

Beginning with v25.02, cuML's default behavior is to use nn-descent for UMAP fit/fit_transform if the dataset is >50K rows, which should hopefully help with at least part of this pipeline today.

However, if you run fit and transform independently the transform must currently use standard brute force KNN. If there's an opportunity to unify theses steps in BERTopic into a fit_transform, that'd be fantastic.

(As a note, the raw data doesn't need to be held on the CPU (data_on_host=True), though this can help with handling datasets that might otherwise be too large or use up significant GPU memory).

beckernick avatar Apr 15 '25 17:04 beckernick

It should be straightforward to go from the fit/transform to use fit_transform instead (as long as it's supported in the dimensionality reduction algorithm).

It's fortunate that the data_on_host isn't necessary with the new release of cuML, as adding this parameter would actually be quite tricky in BERTopic (API-wise).

@beckernick So if I'm not mistaken, going to fit_transform should already solve a big part of the problem, right? If so, then this should be a quick fix.

MaartenGr avatar Apr 16 '25 07:04 MaartenGr

@beckernick So if I'm not mistaken, going to fit_transform should already solve a big part of the problem, right? If so, then this should be a quick fix.

Nice! Yes, using fit_transform will fully utilize the nn-descent algorithm like in the blog.

beckernick avatar Apr 16 '25 13:04 beckernick

This may be too much of an edge case to make it worthwhile, but I'm wondering if it might be possible to add additional specific kwargs to the each of the components that can passed to the calls to fit/transform/fit_transform, something like:

def __init__(
        self,
        language: str = "english",
        top_n_words: int = 10,
        n_gram_range: Tuple[int, int] = (1, 1),
        min_topic_size: int = 10,
        nr_topics: Union[int, str] = None,
        low_memory: bool = False,
        calculate_probabilities: bool = False,
        seed_topic_list: List[List[str]] = None,
        zeroshot_topic_list: List[str] = None,
        zeroshot_min_similarity: float = 0.7,
        embedding_model=None,
        embedding_model_args=None,
        umap_model=None,
        umap_model_args=None,
        hdbscan_model=None,
        hdbscan_model_args=None,
        vectorizer_model: CountVectorizer = None,
        vectorizer_model_args=None,
        ctfidf_model: TfidfTransformer = None,
        ctfidf_model_args=None,
        representation_model: BaseRepresentation = None,
        representation_model_args=None,
        verbose: bool = False,
    ):
...

self.umap_model_args = umap_model_args
self.hdbscan_model_args = hdbscan_model_args
self.vectorizer_model_args = vectorizer_model_args
self.ctfidf_model_args = ctfidf_model_args
self.representation_model_args = representation_model_args

...
 def _reduce_dimensionality(
        self,
        embeddings: Union[np.ndarray, csr_matrix],
        y: Union[List[int], np.ndarray] = None,
        partial_fit: bool = False,
        args = self.umap_kwargs
    )
...
umap_embeddings = self.umap_model.fit_transform(embeddings, **args)

This would allow using data_on_host if needed, but could also allow support for other model-specific fitting options for an arbitrary model that still supports the fit/transform methods.

jemather avatar Apr 16 '25 15:04 jemather

@jemather In BERTopic, I typically want to reduce the hyperparameter space as much as possible. To me, that makes BERTopic easier to use for newcomers and prevents users getting overwhelmed with the amount of options (whilst still allowing for modularity and extensive options).

Thus, I would prefer using the .fit_transform solution here. If you agree, I can start working on a PR (although that might take a while considering I work on BERTopic on the side if I have time) or someone else can take this up.

MaartenGr avatar Apr 24 '25 11:04 MaartenGr