DeepSpeedExamples
DeepSpeedExamples copied to clipboard
[DeepSpeed-Chat] Fix OOM issue in dataloader
Currently, DeepSpeed-Chat directly saves tokenized tensors on disk, which consumes hundreds GB of memory. For each string, it will be converted to max_seq_len of attention_mask and input_ids, stored in int32 or int64.
If we count about 2~3 char per token, then tokenized tensors can take on average hundreds of byte in storage. This is very problematic, and when the prompt dataset becomes larger (say 1GB), the on-disk dataset can be hundreds of GB.
What's worse, DeepSpeed-Chat will load these data in memory, which can require hundreds of GB of memory.
Per my personal experience, my 1.1GB prompt dataset incurs OOM in a 512GB machine, even if I'm just using 512 as max_seq_len. If I want to use 2048 as max_seq_len, that would be four times more memory, i.e. 2TB :(
This PR only saves the string, and tokenizes the string on-the-fly. The saved data are about the same size of the input dataset.
@microsoft-github-policy-service agree
Hi, team, any feedback on this 👀