DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

[DeepSpeed-Chat] Fix OOM issue in dataloader

Open youkaichao opened this issue 1 year ago • 4 comments

Currently, DeepSpeed-Chat directly saves tokenized tensors on disk, which consumes hundreds GB of memory. For each string, it will be converted to max_seq_len of attention_mask and input_ids, stored in int32 or int64.

If we count about 2~3 char per token, then tokenized tensors can take on average hundreds of byte in storage. This is very problematic, and when the prompt dataset becomes larger (say 1GB), the on-disk dataset can be hundreds of GB.

What's worse, DeepSpeed-Chat will load these data in memory, which can require hundreds of GB of memory.

Per my personal experience, my 1.1GB prompt dataset incurs OOM in a 512GB machine, even if I'm just using 512 as max_seq_len. If I want to use 2048 as max_seq_len, that would be four times more memory, i.e. 2TB :(

This PR only saves the string, and tokenizes the string on-the-fly. The saved data are about the same size of the input dataset.

youkaichao avatar Jan 01 '24 07:01 youkaichao

@microsoft-github-policy-service agree

youkaichao avatar Jan 01 '24 07:01 youkaichao

Hi, team, any feedback on this 👀

youkaichao avatar Jan 03 '24 12:01 youkaichao