BERTopic Running BERTopic on Apple Silcon (M1 & M2 chips, ARM64 architecture)

Situation

Apple Silicon chips (M1 & M2) are based on the ARM64 (aka AArch64, not to be confused with AMD64). There are known issues with upstream dependencies for this architecture, for example numba. You may not always run into this issue, depending on the extras that you need. See also #1014 and #1765.

Solution: Use VS Code Dev Containers

Using VS Code Dev Containers allows you to setup a Linux-based environment. To run BERTopic effectively you need to be aware of two things:

Make sure to use a Docker image specifically compiled for ARM64
Make sure to use volume instead of mount-bind, since the latter significantly reduces I/O speeds to disk

Using https://github.com/b-data/data-science-devcontainers address this issue out of the box (kudos to @benz0li). The propose workflow is as follows:

Install python-base or python-scipy devcontainer
Open VS Code, build the container
Work in /home/vscode or /workspaces as these locations are persisted. For example, using pip install --user bertopic installs all Python packages persistently.

Issues

Using Python 3.12 with the data-science-devcontainer to run the following

from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

# Prepare embeddings
docs = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# Train BERTopic
topic_model = BERTopic(verbose=True).fit(docs, embeddings)

# Reduce dimensionality of embeddings, this step is optional
reduced_embeddings = UMAP(
    n_neighbors=10, n_components=2, min_dist=0.0, metric="cosine"
).fit_transform(embeddings)

# Run the visualization with the original embeddings
# topic_model.visualize_document_datamap(docs, embeddings=embeddings)

# Or, if you have reduced the original embeddings already:
fig = topic_model.visualize_document_datamap(
    docs, reduced_embeddings=reduced_embeddings
)

Yields the following error

2024-02-27 09:08:11,438 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-27 09:08:30,791 - BERTopic - Dimensionality - Completed ✓
2024-02-27 09:08:30,792 - BERTopic - Cluster - Start clustering the reduced embeddings
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2024-02-27 09:08:31,927 - BERTopic - Cluster - Completed ✓
2024-02-27 09:08:31,933 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-27 09:08:34,449 - BERTopic - Representation - Completed ✓

__The Kernel crashed while executing code in the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure.__

Feb 27 '24 10:02 dkapitan

ℹ️ Change PYTHON_VERSION to 3.11 in the devcontainer.json to work with the latest patch release of Python 3.11 – currently 3.11.8.

Feb 27 '24 10:02 benz0li

Work in /home/vscode or /workspaces as these locations are persisted.

When using an unmodified devcontainer.json: Work in /home/vscode.
👉 This is the home directory of user vscode.

For example, using pip install --user bertopic installs all Python packages persistently.

Python packages are installed to the home directory by default.
👉 This is due to env variable PIP_USER=1.

BERTopic BERTopic copied to clipboard

Running BERTopic on Apple Silcon (M1 & M2 chips, ARM64 architecture)

Situation

Solution: Use VS Code Dev Containers

Issues

BERTopic
BERTopic copied to clipboard