BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

The 'X' parameter of normalize must be an array-like or a sparse matrix.

Open aerithnetzer opened this issue 10 months ago • 2 comments

Have you searched existing issues? 🔎

  • [x] I have searched and found no existing issues

Desribe the bug

I have a very large dataset, so I have been saving and loading models as is documented a good practice elsewhere. The problem is, when I load and merge these, I cannot do the class-based or time-dependent topic representations, as it seems c-tf-idf information is not saved when a model is saved to disk in the model.

Full error log:

Traceback (most recent call last):
  File "/Users/ysc4337/warlock/merging-test/main.py", line 56, in <module>
    topics_over_time = merged_model.topics_over_time(all_docs, all_timestamps)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ysc4337/warlock/merging-test/.venv/lib/python3.11/site-packages/bertopic/_bertopic.py", line 860, in topics_over_time
    global_c_tf_idf = normalize(self.c_tf_idf_, axis=1, norm="l1", copy=False)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ysc4337/warlock/merging-test/.venv/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 206, in wrapper
    validate_parameter_constraints(
  File "/Users/ysc4337/warlock/merging-test/.venv/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'X' parameter of normalize must be an array-like or a sparse matrix. Got None instead.
(merging-test) ➜  merging-test git:(master) ✗ 

I have tried saving the c_tf_idf matrix as a workaround, but I could not get the code working.

Reproduction

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
import random
from datetime import datetime, timedelta

# Fetch the dataset
data = fetch_20newsgroups(subset="all")
documents = data.data
categories = data.target

# Generate synthetic timestamps
start_date = datetime(2000, 1, 1)
timestamps = [
    str(start_date + timedelta(days=random.randint(0, 7300)))
    for _ in range(len(documents))
]

# Combine data with timestamps and categories
data_with_metadata = list(zip(documents, timestamps, categories))

# Split data into two parts for training two separate models
data_part1 = data_with_metadata[: len(data_with_metadata) // 2]
data_part2 = data_with_metadata[len(data_with_metadata) // 2 :]

# Extract documents and timestamps from both parts
docs_part1 = [doc for doc, _, _ in data_part1]
docs_part2 = [doc for doc, _, _ in data_part2]
timestamps_part1 = [ts for _, ts, _ in data_part1]
timestamps_part2 = [ts for _, ts, _ in data_part2]

model1 = BERTopic(verbose=True)
model2 = BERTopic(verbose=True)

topics, _ = model1.fit_transform(docs_part1)
topics, _ = model2.fit_transform(docs_part2)

model1.save("bertopic_model1")
model2.save("bertopic_model2")

model1 = BERTopic.load("bertopic_model1")
model2 = BERTopic.load("bertopic_model2")
# Merge the models

merged_model = BERTopic.merge_models([model1, model2])


# Reload the models

# Combine all documents and timestamps
all_docs = docs_part1 + docs_part2
all_timestamps = timestamps_part1 + timestamps_part2

# Fit the merged model to generate the c-TF-IDF matrix

# Generate topics over time visualization
topics_over_time = merged_model.topics_over_time(all_docs, all_timestamps)
merged_model.visualize_topics_over_time(topics_over_time).write_html(
    "topics_over_time.html"
)

BERTopic Version

0.17.0

aerithnetzer avatar Apr 16 '25 17:04 aerithnetzer

The c-TF-IDF matrix is saved when saving the model to disk (which is done with pickle in your example). However, it is not combined when merging two models.

Although definitely not impossible, it is quite tricky to merge two c-TF-IDF models when those are likely to contain different vocabularies, input data, size of data, etc. I have created a federated version of TF-IDF in the past but it does require some significant code changes. Since working on BERTopic is essentially a hobby, I unfortunately do not have the time to implement such a feature at the moment.

If you have the documents you trained both models on, you could try to create a new c-TF-IDF by simply running .update_topics on the merged_model. I'm not entirely sure but that might suffice here.

MaartenGr avatar Apr 24 '25 11:04 MaartenGr