NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Running into OOM with add id

Open yyu22 opened this issue 1 year ago • 2 comments
trafficstars

Describe the bug

Running the add id module of curator runs into ooms even with small batch size, e.g., 32. The dataset for adding ID is a single snapshot of Red Pajama v2 dataset, which is about 4 TB in size. Job was run on 10 cpu nodes. Each cpu node has 96 cores and 176 GB memory

Jun 25 13:48:18.323459 942129 slurmstepd   0x155552de2d40: error: Detected 7 oom_kill events in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: error: cpu-00046: task 5: Out Of Memory
srun: Terminating StepId=1127164.0
Jun 25 13:48:19.899767 2590557 slurmstepd   0x155552de2d40: error: Detected 1 oom_kill event in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: error: cpu-00017: task 2: Terminated
srun: error: cpu-00038: task 3: Terminated
srun: error: cpu-00050: task 6: Terminated
Jun 25 13:48:20.991455 2567860 slurmstepd   0x155552de2d40: error: Detected 2 oom_kill events in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: Force Terminated StepId=1127164.0

some observations:

  • Memory usage is extremely unbalanced across the nodes
cpu-00009              total        used        free      shared  buff/cache   available
Mem:            176          14         144           0          18         160
cpu-00042              total        used        free      shared  buff/cache   available
Mem:            176          79          88           0           8          94
cpu-00046              total        used        free      shared  buff/cache   available
Mem:            176         113          61           0           2          61
cpu-00082              total        used        free      shared  buff/cache   available
Mem:            176          13         145           0          17         160
cpu-00050              total        used        free      shared  buff/cache   available
Mem:            176          74          78           0          23          99
cpu-00019              total        used        free      shared  buff/cache   available
Mem:            176          72          38           0          65         101
cpu-00087              total        used        free      shared  buff/cache   available
Mem:            176          55         106           0          15         119
cpu-00086              total        used        free      shared  buff/cache   available
Mem:            176          90          80           0           6          84
cpu-00020              total        used        free      shared  buff/cache   available
Mem:            176          36         101           0          39         138
cpu-00002              total        used        free      shared  buff/cache   available
Mem:            176         156           2           0          17          18
  • Some nodes has very little memory left while some do not have any memory usage: (available memory shown in the last column)
$ grep -A 1 'cpu-00002' ./log.txt  | grep 'Mem:'
Mem:            176          85          64           0          26          88
Mem:            176         171           3           0           1           3
Mem:            176         141          28           0           6          32
Mem:            176         113          45           0          17          60
Mem:            176         113          45           0          17          60
Mem:            176         113          45           0          17          60

$ grep -A 1 'cpu-00082' ./log.txt  | grep 'Mem:'
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         144           0          17         160
Mem:            176          14         144           0          17         160
Mem:            176          14         144           0          17         160
Mem:            176          14         144           0          17         160

  • cpu utilization is very low.

  • Setting the start-index argument slows down the code

  • IO speed decreasing over time

Steps/Code to reproduce bug

    batch_index = 0
    for files in get_batched_files(data_path, id_data_path, "jsonl", batch_size=128):
        dataset = DocumentDataset.read_json(files, add_filename=True)
        print("Done reading dataset")
        add_id = AddId(
            id_field='id',
            id_prefix=f"rpv2-{batch_index}",
            )
        print("Start adding id")
        id_dataset = add_id(dataset)
        print("Done adding id")
        id_dataset.to_json(id_data_path, write_to_filename=True)
        batch_index += 1

yyu22 avatar Jul 08 '24 16:07 yyu22