NeMo-Curator
NeMo-Curator copied to clipboard
Running into OOM with add id
trafficstars
Describe the bug
Running the add id module of curator runs into ooms even with small batch size, e.g., 32. The dataset for adding ID is a single snapshot of Red Pajama v2 dataset, which is about 4 TB in size. Job was run on 10 cpu nodes. Each cpu node has 96 cores and 176 GB memory
Jun 25 13:48:18.323459 942129 slurmstepd 0x155552de2d40: error: Detected 7 oom_kill events in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: error: cpu-00046: task 5: Out Of Memory
srun: Terminating StepId=1127164.0
Jun 25 13:48:19.899767 2590557 slurmstepd 0x155552de2d40: error: Detected 1 oom_kill event in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: error: cpu-00017: task 2: Terminated
srun: error: cpu-00038: task 3: Terminated
srun: error: cpu-00050: task 6: Terminated
Jun 25 13:48:20.991455 2567860 slurmstepd 0x155552de2d40: error: Detected 2 oom_kill events in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: Force Terminated StepId=1127164.0
some observations:
- Memory usage is extremely unbalanced across the nodes
cpu-00009 total used free shared buff/cache available
Mem: 176 14 144 0 18 160
cpu-00042 total used free shared buff/cache available
Mem: 176 79 88 0 8 94
cpu-00046 total used free shared buff/cache available
Mem: 176 113 61 0 2 61
cpu-00082 total used free shared buff/cache available
Mem: 176 13 145 0 17 160
cpu-00050 total used free shared buff/cache available
Mem: 176 74 78 0 23 99
cpu-00019 total used free shared buff/cache available
Mem: 176 72 38 0 65 101
cpu-00087 total used free shared buff/cache available
Mem: 176 55 106 0 15 119
cpu-00086 total used free shared buff/cache available
Mem: 176 90 80 0 6 84
cpu-00020 total used free shared buff/cache available
Mem: 176 36 101 0 39 138
cpu-00002 total used free shared buff/cache available
Mem: 176 156 2 0 17 18
- Some nodes has very little memory left while some do not have any memory usage: (available memory shown in the last column)
$ grep -A 1 'cpu-00002' ./log.txt | grep 'Mem:'
Mem: 176 85 64 0 26 88
Mem: 176 171 3 0 1 3
Mem: 176 141 28 0 6 32
Mem: 176 113 45 0 17 60
Mem: 176 113 45 0 17 60
Mem: 176 113 45 0 17 60
$ grep -A 1 'cpu-00082' ./log.txt | grep 'Mem:'
Mem: 176 13 145 0 17 160
Mem: 176 13 145 0 17 160
Mem: 176 13 145 0 17 160
Mem: 176 13 145 0 17 160
Mem: 176 13 145 0 17 160
Mem: 176 13 145 0 17 160
Mem: 176 13 145 0 17 160
Mem: 176 13 144 0 17 160
Mem: 176 14 144 0 17 160
Mem: 176 14 144 0 17 160
Mem: 176 14 144 0 17 160
-
cpu utilization is very low.
-
Setting the
start-indexargument slows down the code -
IO speed decreasing over time
Steps/Code to reproduce bug
batch_index = 0
for files in get_batched_files(data_path, id_data_path, "jsonl", batch_size=128):
dataset = DocumentDataset.read_json(files, add_filename=True)
print("Done reading dataset")
add_id = AddId(
id_field='id',
id_prefix=f"rpv2-{batch_index}",
)
print("Start adding id")
id_dataset = add_id(dataset)
print("Done adding id")
id_dataset.to_json(id_data_path, write_to_filename=True)
batch_index += 1