Cluster takes 50GB memory for a 1B model

Open wzqww23 opened this issue 5 months ago • 1 comments

SKMPalettizer uses a C++ function called "cluster_impl". I have done some math and this function takes 40-50GB of memory to palletize a 1B model, which makes it almost impossible to run. In the comment it mentions that

"" // TODO: This step requires O(kn) memory usage due to saving the entire // T matrix. However, it can be modified so that the memory usage is O(n). // D and T would not need to be retained in full (D already doesn't need // to be fully retained, although it currently is). // Details are in section 3 of (Grønlund et al., 2017). ""

I wonder if this can be implemented?

Jul 01 '25 07:07 wzqww23

Thank you for reaching out. Can you please share the configuration you are using with SKMPalettizer?

Yes the clustering is done using kmeans1d c++ library. While we do not maintain the external library, we recently added some optimizations to PostTrainingPalettizer, which uses the same kmeans1d backend, to significantly speed up clustering and reduce required memory.

When weight dtype is fp16, or, weight dtype is fp32 but data is in fp16 range, we do two things:

Round the weights (See enable_fast_kmeans_mode option in ModulePostTrainingPalettizerConfig for more info).
Find unique set of weights in the rounded output
Cluster only the unique weights

This way we are able to significantly reduce the number of values being clustered and hence the time and memory complexity. We are working on exposing the same support for SKMPalettizer as well.

Jul 01 '25 20:07 NehalBhandari