BERTopic
BERTopic copied to clipboard
Running BERTopic on Apple Silcon (M1 & M2 chips, ARM64 architecture)
Situation
Apple Silicon chips (M1 & M2) are based on the ARM64 (aka AArch64, not to be confused with AMD64). There are known issues with upstream dependencies for this architecture, for example numba. You may not always run into this issue, depending on the extras that you need. See also #1014 and #1765.
Solution: Use VS Code Dev Containers
Using VS Code Dev Containers allows you to setup a Linux-based environment. To run BERTopic effectively you need to be aware of two things:
- Make sure to use a Docker image specifically compiled for ARM64
- Make sure to use
volume
instead ofmount-bind
, since the latter significantly reduces I/O speeds to disk
Using https://github.com/b-data/data-science-devcontainers address this issue out of the box (kudos to @benz0li). The propose workflow is as follows:
- Install
python-base
orpython-scipy
devcontainer - Open VS Code, build the container
- Work in
/home/vscode
or/workspaces
as these locations are persisted. For example, usingpip install --user bertopic
installs all Python packages persistently.
Issues
Using Python 3.12 with the data-science-devcontainer
to run the following
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
# Prepare embeddings
docs = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)
# Train BERTopic
topic_model = BERTopic(verbose=True).fit(docs, embeddings)
# Reduce dimensionality of embeddings, this step is optional
reduced_embeddings = UMAP(
n_neighbors=10, n_components=2, min_dist=0.0, metric="cosine"
).fit_transform(embeddings)
# Run the visualization with the original embeddings
# topic_model.visualize_document_datamap(docs, embeddings=embeddings)
# Or, if you have reduced the original embeddings already:
fig = topic_model.visualize_document_datamap(
docs, reduced_embeddings=reduced_embeddings
)
Yields the following error
2024-02-27 09:08:11,438 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-27 09:08:30,791 - BERTopic - Dimensionality - Completed ✓
2024-02-27 09:08:30,792 - BERTopic - Cluster - Start clustering the reduced embeddings
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2024-02-27 09:08:31,927 - BERTopic - Cluster - Completed ✓
2024-02-27 09:08:31,933 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-27 09:08:34,449 - BERTopic - Representation - Completed ✓
__The Kernel crashed while executing code in the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure.__
ℹ️ Change PYTHON_VERSION
to 3.11
in the devcontainer.json
to work with the latest patch release of Python 3.11 – currently 3.11.8.
- Work in
/home/vscode
or/workspaces
as these locations are persisted.
When using an unmodified devcontainer.json
: Work in /home/vscode
.
👉 This is the home directory of user vscode
.
For example, using
pip install --user bertopic
installs all Python packages persistently.
Python packages are installed to the home directory by default.
👉 This is due to env variable PIP_USER=1
.
See also
- Data Science Dev Containers > Notes: Tweaks > Environment variables
- https://github.com/b-data/data-science-devcontainers/issues/3#issuecomment-1609457302
@dkapitan Thank you for sharing your extensive work on this. It is greatly appreciated!
It's unfortunate that these additional steps are necessary in order to get it working nicely but it's great that there is at least a solution for users to follow.
I might be mistaken here but you mentioned it somewhere else I believe that this issue is also likely to appear when using datamapplot. Is that correct?
Indeed, I ran into these issues whilst working on the documentation for datamapplot. I have got it to work now, and will add a section to the FAQ.