BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Running BERTopic on Apple Silcon (M1 & M2 chips, ARM64 architecture)

Open dkapitan opened this issue 11 months ago • 4 comments

Situation

Apple Silicon chips (M1 & M2) are based on the ARM64 (aka AArch64, not to be confused with AMD64). There are known issues with upstream dependencies for this architecture, for example numba. You may not always run into this issue, depending on the extras that you need. See also #1014 and #1765.

Solution: Use VS Code Dev Containers

Using VS Code Dev Containers allows you to setup a Linux-based environment. To run BERTopic effectively you need to be aware of two things:

  • Make sure to use a Docker image specifically compiled for ARM64
  • Make sure to use volume instead of mount-bind, since the latter significantly reduces I/O speeds to disk

Using https://github.com/b-data/data-science-devcontainers address this issue out of the box (kudos to @benz0li). The propose workflow is as follows:

  • Install python-base or python-scipy devcontainer
  • Open VS Code, build the container
  • Work in /home/vscode or /workspaces as these locations are persisted. For example, using pip install --user bertopic installs all Python packages persistently.

Issues

Using Python 3.12 with the data-science-devcontainer to run the following

from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

# Prepare embeddings
docs = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# Train BERTopic
topic_model = BERTopic(verbose=True).fit(docs, embeddings)

# Reduce dimensionality of embeddings, this step is optional
reduced_embeddings = UMAP(
    n_neighbors=10, n_components=2, min_dist=0.0, metric="cosine"
).fit_transform(embeddings)

# Run the visualization with the original embeddings
# topic_model.visualize_document_datamap(docs, embeddings=embeddings)

# Or, if you have reduced the original embeddings already:
fig = topic_model.visualize_document_datamap(
    docs, reduced_embeddings=reduced_embeddings
)

Yields the following error

2024-02-27 09:08:11,438 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-27 09:08:30,791 - BERTopic - Dimensionality - Completed ✓
2024-02-27 09:08:30,792 - BERTopic - Cluster - Start clustering the reduced embeddings
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2024-02-27 09:08:31,927 - BERTopic - Cluster - Completed ✓
2024-02-27 09:08:31,933 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-27 09:08:34,449 - BERTopic - Representation - Completed ✓

__The Kernel crashed while executing code in the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure.__

dkapitan avatar Feb 27 '24 10:02 dkapitan

ℹ️ Change PYTHON_VERSION to 3.11 in the devcontainer.json to work with the latest patch release of Python 3.11 – currently 3.11.8.

benz0li avatar Feb 27 '24 10:02 benz0li

  • Work in /home/vscode or /workspaces as these locations are persisted.

When using an unmodified devcontainer.json: Work in /home/vscode.
👉 This is the home directory of user vscode.

For example, using pip install --user bertopic installs all Python packages persistently.

Python packages are installed to the home directory by default.
👉 This is due to env variable PIP_USER=1.


See also

benz0li avatar Feb 27 '24 13:02 benz0li

@dkapitan Thank you for sharing your extensive work on this. It is greatly appreciated!

It's unfortunate that these additional steps are necessary in order to get it working nicely but it's great that there is at least a solution for users to follow.

I might be mistaken here but you mentioned it somewhere else I believe that this issue is also likely to appear when using datamapplot. Is that correct?

MaartenGr avatar Mar 03 '24 10:03 MaartenGr

Indeed, I ran into these issues whilst working on the documentation for datamapplot. I have got it to work now, and will add a section to the FAQ.

dkapitan avatar Mar 03 '24 12:03 dkapitan