BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Throttling of containers

Open marekargalas opened this issue 1 year ago • 1 comments

Hey!

I got more technical/devops question that maybe some of you encountered before. I got k8s cluster with ec2 instances where I run clustering code.

First I have noticed that BERTopic is taking all available CPUs from the container (k8s pod). umap and hdbscan goes well and not squeezing much resources but when I use mmr / diversity then loop of _extract_embeddings and mmr taking too long. Once I disable mmr then _create_topic_vectors is taking forever and squeezing all CPUs. I went through the code and I haven't seen any multiprocessing so any idea what it's taking all resources and if limit could be set?

Second challenge is when I run such a job in parallel. There are multiple containers on same ec2 with restricted cpu_limits and suddenly the jobs are taking 10x longer compared to single run. My guess is that every k8s pod/container is throttling due to maxing all CPU resources thus it got effect on ec2 and all the jobs slowing down dramatically. Any idea if such a thing could happen?

I feel those two challenges are interconnected. When job is running alone in ec2 then it's taking just a few minutes (cpu_limits applied) but when there are multiple containers with same cpu_limits suddenly it takes 10x longer. Number of documents is mid 1000's and memory is not an issue. Any idea how to debug this behaviour or how to find bottleneck?

marekargalas avatar Sep 05 '22 09:09 marekargalas

Thank you for the extensive description.

First I have noticed that BERTopic is taking all available CPUs from the container (k8s pod). umap and hdbscan goes well and not squeezing much resources but when I use mmr / diversity then loop of _extract_embeddings and mmr taking too long. Once I disable mmr then _create_topic_vectors is taking forever and squeezing all CPUs. I went through the code and I haven't seen any multiprocessing so any idea what it's taking all resources and if limit could be set?

You could disable _create_topic_vectors by calculating the embeddings beforehand and passing them to the fit step: .fit_transform(docs, embeddings). It is important though that you do not set any embedding_model when instantiation BERTopic. I am surprised though as those functions generally are not bottlenecks. Do you have a GPU enabled?

Second challenge is when I run such a job in parallel. There are multiple containers on same ec2 with restricted cpu_limits and suddenly the jobs are taking 10x longer compared to single run. My guess is that every k8s pod/container is throttling due to maxing all CPU resources thus it got effect on ec2 and all the jobs slowing down dramatically. Any idea if such a thing could happen?

Unfortunately, I have not used BERTopic before in that specific instance so I cannot give recommendations with respect to your environment. I am not entirely sure if this is BERTopic-related or due to the environment you work in.

MaartenGr avatar Sep 08 '22 07:09 MaartenGr

This is so interesting: I think I have a similar issue, or maybe not, but the clustering step in HDBscan seems to take almost 2x longer on railway.app than on google cloud run when both are using 8vcpus and 8gb of ram.

I made a benchmark repo to test this here https://github.com/spookyuser/slow-railway-example

I haven't tried disabling mmr, so maybe that would work, I don't really know what railway.io uses to manage their users containers but maybe it's k8s so could be related 🤔

spookyuser avatar Nov 03 '22 07:11 spookyuser

Due to inactivity, I'll be closing this for now. If you have any questions or want to continue the discussion, I'll make sure to re-open the issue!

MaartenGr avatar Jan 09 '23 12:01 MaartenGr