torchchat [Distributed] Support loading from single checkpoint binary

[Distributed] Support loading from single checkpoint binary

Open kwen2501 opened this issue 5 months ago • 0 comments

🚀 The feature, motivation and pitch

This is for aligning distributed's load behavior with single-device's case. Today distributed relies on an index file containing a param->bin mapping to limit the number of bins that need to be opened. However, not all checkpoint styles come with an index file.

To avoid all processing opening a large bin and OOM CPU, we can use torch.load(mmap=True). Though different processes would create their virtual memory space mapped to the file, the OS would only load 1 copy of defaulted pages into the physical memory, and the result can be shared between processes and moved to corresponding device memory.

Alternatives

No response

Additional context

cc: @lessw2020 @mikaylagawarecki

Sep 20 '24 22:09 kwen2501

torchchat torchchat copied to clipboard

[Distributed] Support loading from single checkpoint binary

🚀 The feature, motivation and pitch

Alternatives

Additional context

torchchat
torchchat copied to clipboard