BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

IndexError: arrays used as indices must be of integer (or boolean) type

Open nrydanov opened this issue 7 months ago • 3 comments

Have you searched existing issues? 🔎

  • [x] I have searched and found no existing issues

Desribe the bug

Actually, I found similar, but closed issue so I would like to reopen that.

I'm trying to use this library for clustering news in stream-like way. So, when system is started it contains zero elements.

It's not clear how much elements do you need to be sure that clustering process should finish right.

I've lost several understand to ensure that this error (IndexError: arrays used as indices must be of integer (or boolean) type) PROBABLY occurs because there's few elements in topic. And I still continue to debug it.

Error says nothing about root cause and it would be nice to fix it...

Reproduction

import logging

import numpy as np
import redis.asyncio as redis
from minio import Minio
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic

from config.settings import settings
from umap import UMAP
import asyncio
from bertopic.representation import KeyBERTInspired


logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logging.basicConfig(
    level=logging.INFO,  # или INFO, если не нужны DEBUG-сообщения
    format="%(asctime)s %(levelname)s %(name)s %(message)s"
)

async def main():
    logger.info("Starting clustering service")
    r = redis.Redis(host=settings.redis.host, port=settings.redis.port, db=0)
    if not await r.ping():
        logger.error("Failed to connect to redis")
        exit(1)
    else:
        logger.info("Connected to redis")

    s3 = Minio(
        settings.minio.endpoint,
        access_key=settings.minio.access_key,
        secret_key=settings.minio.secret_key,
        secure=False,
    )

    pubsub = r.pubsub()

    await pubsub.subscribe("inbrief")

    model = SentenceTransformer(
        settings.embedding_model.model_name,
        trust_remote_code=settings.embedding_model.trust_remote_code
    )


    dim = model.encode("Hello world!", task="separation").shape[0]

    logger.debug("Embedding dimension: %s", dim)

    embeddings = np.empty((0, dim))
    texts = []

    umap = UMAP(n_neighbors=15, n_components=2, metric='cosine')
    bertopic = BERTopic(
        language=settings.embedding_model.language,
        embedding_model=model,
        representation_model=KeyBERTInspired()
    )


    while True:
        msg = await pubsub.get_message(timeout=None)
        if msg is None:
            break

        logger.debug("Received message: %s", msg)
        if msg["type"] != "message":
            continue


        filename = f"{msg['data'].decode('utf-8')}.json"

        resp = s3.get_object("inbrief", filename)

        payload = resp.json()

        new_texts = list(map(lambda x: x['text'], payload))
        new_embeddings = model.encode(new_texts, task="separation")
        embeddings = np.append(embeddings, new_embeddings, axis=0)
        texts.extend(new_texts)

        try:
            reduced_embeddings = umap.fit_transform(embeddings)
            bertopic.fit_transform(texts, embeddings=embeddings)

            bertopic.visualize_topics().write_html("static/topics.html")
            bertopic.visualize_documents(texts, reduced_embeddings=reduced_embeddings).write_html("static/documents.html")
        except Exception as e:
            logger.warning(f"Got an error while clustering, will try later: {e}", exc_info=True)
            continue

        logger.debug("Number of texts: %s", len(texts))



if __name__ == "__main__":
    asyncio.run(main())

BERTopic Version

0.17.0

nrydanov avatar May 23 '25 16:05 nrydanov

Actually, I found similar, but closed issue so I would like to reopen that.

Could you share that/those issue(s)? It would help seeing similar issues.

I've lost several understand to ensure that this error (IndexError: arrays used as indices must be of integer (or boolean) type) PROBABLY occurs because there's few elements in topic. And I still continue to debug it.

Could you share the full stack trace? Without it, it's a bit tricky to understand what is happening here.

while True:

I'm a bit confused as to what's happening here. It seems you are trying to run the same BERTopic multiple times using fit_transform but fit_transform does add topics, it creates an entirely new topic model. So you are not streaming here because that is not supported using that function. You would have to use either .partial_fit or .merge_models (see docs here and here).

MaartenGr avatar May 30 '25 14:05 MaartenGr

Yeah, sure. I just was very busy (making my master's work) so it was hard to give complete description

Previous issue on same topic: https://github.com/MaartenGr/BERTopic/issues/1025

Actually, I've got several issues when using BERTopic for my task:

  1. TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k — occurs when there's too few records and the reason is constructive limitations of UMAP as I understand. But it would probably be nice to have high-level descriptive error, not error somewhere underneath.
  2. IndexError: arrays used as indices must be of integer (or boolean) type

I'm unable to reproduce this issue right now, so I can't give complete stack trace, but I remember conclusions I had:

  1. It occurs when using visualize_topics function
  2. Exact lines of codes where it does happen (in visualize_topics function):
    # Embed c-TF-IDF into 2D
    all_topics = sorted(list(topic_model.get_topics().keys()))
    indices = np.array([all_topics.index(topic) for topic in topics])

    embeddings, c_tfidf_used = select_topic_representation(
        topic_model.c_tf_idf_,
        topic_model.topic_embeddings_,
        use_ctfidf=use_ctfidf,
        output_ndarray=True,
    )
    embeddings = embeddings[indices]
  1. I've looked up embeddings values and didn't find anything suspicious

If I get full stacktrace, I'll send you it there.

About streaming and so on... I understand that this code is a little bit shitty, but I don't have much entities to cluster so full model learn on each iteration is ok for me as they're sent as rare batches. I've read docs, understood that partial_fit is not my case (as I need to use HDBSCAN) and as I understand merge_models can give worse results comparing with full new model.

Thanks for attention.

nrydanov avatar May 31 '25 06:05 nrydanov

TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k — occurs when there's too few records and the reason is constructive limitations of UMAP as I understand. But it would probably be nice to have high-level descriptive error, not error somewhere underneath.

It's going to be tricky for me to debug without the full stack trace. That said, it might indeed relate to too few records for UMAP to train properly (although I'm not sure considering I can't see when or where that specific error popped up).

Was this issue also raised using the visualize_topics function? If so, then it is definitely worthwhile to make sure you have sufficient clusters first considering having a single topics generally is a problem regardless of this specific function.

About streaming and so on... I understand that this code is a little bit shitty, but I don't have much entities to cluster so full model learn on each iteration is ok for me as they're sent as rare batches. I've read docs, understood that partial_fit is not my case (as I need to use HDBSCAN) and as I understand merge_models can give worse results comparing with full new model.

I would still highly advise using merge_models especially if you already have a single large dataset to work with that you can train a model on. Then, you can use merge_models to iteratively update that model if necessary. Generally, I wouldn't advise training a topic model on only a few documents. Rather, collect a couple hundred first and then train a model.

MaartenGr avatar Jun 06 '25 09:06 MaartenGr