DeepSpeedExamples
DeepSpeedExamples copied to clipboard
create_dataset_split function: When the data volume is large, it may cause memory overflow. In this case, we should use the map function in datasets.
below is the original code:
https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/utils/data/data_utils.py#L157
In my experiments, it will oom when dataset size is 500000