Cluster takes 50GB memory for a 1B model
SKMPalettizer uses a C++ function called "cluster_impl". I have done some math and this function takes 40-50GB of memory to palletize a 1B model, which makes it almost impossible to run. In the comment it mentions that
"" // TODO: This step requires O(kn) memory usage due to saving the entire // T matrix. However, it can be modified so that the memory usage is O(n). // D and T would not need to be retained in full (D already doesn't need // to be fully retained, although it currently is). // Details are in section 3 of (Grønlund et al., 2017). ""
I wonder if this can be implemented?
Thank you for reaching out. Can you please share the configuration you are using with SKMPalettizer?
Yes the clustering is done using kmeans1d c++ library. While we do not maintain the external library, we recently added some optimizations to PostTrainingPalettizer, which uses the same kmeans1d backend, to significantly speed up clustering and reduce required memory.
When weight dtype is fp16, or, weight dtype is fp32 but data is in fp16 range, we do two things:
- Round the weights (See
enable_fast_kmeans_modeoption in ModulePostTrainingPalettizerConfig for more info). - Find unique set of weights in the rounded output
- Cluster only the unique weights
This way we are able to significantly reduce the number of values being clustered and hence the time and memory complexity.
We are working on exposing the same support for SKMPalettizer as well.