BERTopic
BERTopic copied to clipboard
Throttling of containers
Hey!
I got more technical/devops question that maybe some of you encountered before. I got k8s cluster with ec2
instances where I run clustering code.
First I have noticed that BERTopic
is taking all available CPUs from the container (k8s pod). umap
and hdbscan
goes well and not squeezing much resources but when I use mmr
/ diversity
then loop of _extract_embeddings
and mmr
taking too long. Once I disable mmr
then _create_topic_vectors
is taking forever and squeezing all CPUs. I went through the code and I haven't seen any multiprocessing so any idea what it's taking all resources and if limit could be set?
Second challenge is when I run such a job in parallel. There are multiple containers on same ec2
with restricted cpu_limits
and suddenly the jobs are taking 10x longer compared to single run. My guess is that every k8s pod/container is throttling due to maxing all CPU resources thus it got effect on ec2
and all the jobs slowing down dramatically. Any idea if such a thing could happen?
I feel those two challenges are interconnected. When job is running alone in ec2
then it's taking just a few minutes (cpu_limits
applied) but when there are multiple containers with same cpu_limits
suddenly it takes 10x longer. Number of documents is mid 1000's and memory is not an issue. Any idea how to debug this behaviour or how to find bottleneck?
Thank you for the extensive description.
First I have noticed that BERTopic is taking all available CPUs from the container (k8s pod). umap and hdbscan goes well and not squeezing much resources but when I use mmr / diversity then loop of _extract_embeddings and mmr taking too long. Once I disable mmr then _create_topic_vectors is taking forever and squeezing all CPUs. I went through the code and I haven't seen any multiprocessing so any idea what it's taking all resources and if limit could be set?
You could disable _create_topic_vectors
by calculating the embeddings beforehand and passing them to the fit step: .fit_transform(docs, embeddings)
. It is important though that you do not set any embedding_model
when instantiation BERTopic. I am surprised though as those functions generally are not bottlenecks. Do you have a GPU enabled?
Second challenge is when I run such a job in parallel. There are multiple containers on same ec2 with restricted cpu_limits and suddenly the jobs are taking 10x longer compared to single run. My guess is that every k8s pod/container is throttling due to maxing all CPU resources thus it got effect on ec2 and all the jobs slowing down dramatically. Any idea if such a thing could happen?
Unfortunately, I have not used BERTopic before in that specific instance so I cannot give recommendations with respect to your environment. I am not entirely sure if this is BERTopic-related or due to the environment you work in.
This is so interesting: I think I have a similar issue, or maybe not, but the clustering step in HDBscan seems to take almost 2x longer on railway.app than on google cloud run when both are using 8vcpus and 8gb of ram.
I made a benchmark repo to test this here https://github.com/spookyuser/slow-railway-example
I haven't tried disabling mmr, so maybe that would work, I don't really know what railway.io uses to manage their users containers but maybe it's k8s so could be related 🤔
Due to inactivity, I'll be closing this for now. If you have any questions or want to continue the discussion, I'll make sure to re-open the issue!