storage icon indicating copy to clipboard operation
storage copied to clipboard

Python stack trace while executing datagen and run

Open harisphnx opened this issue 1 year ago • 0 comments

I get the following stack trace while executing datagen, but after that the datagen continues normally. It does not finish though. I left it running overnight by morning it has finished.

[INFO] 2024-09-26T23:13:56.862045 Starting data generation [/root/storage/dlio_benchmark/dlio_benchmark/main.py:157]
[INFO] 2024-09-26T23:13:56.862436 Generating dataset in unet3d_data/train and unet3d_data/valid [/root/storage/dlio_benchmark/dlio_benchmark/data_generator/data_generator.py:77]
[INFO] 2024-09-26T23:13:56.862501 Number of files for training dataset: 7000 [/root/storage/dlio_benchmark/dlio_benchmark/data_generator/data_generator.py:78]
[INFO] 2024-09-26T23:13:56.862548 Number of files for validation dataset: 0 [/root/storage/dlio_benchmark/dlio_benchmark/data_generator/data_generator.py:79]
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=True', '++workload.workflow.train=False', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
    benchmark.initialize()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 158, in initialize
    self.data_generator.generate()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/data_generator/npz_generator.py", line 55, in generate
    np.savez(out_path_spec, x=records, y=record_labels)
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 639, in savez
    _savez(file, args, kwds, False)
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 736, in _savez
    zipf = zipfile_factory(file, mode="w", compression=compression)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
    return zipfile.ZipFile(file, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/zipfile.py", line 1284, in __init__
    self.fp = io.open(file, filemode)
              ^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train/img_0001_of_7000.npz'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=True', '++workload.workflow.train=False', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
    benchmark.initialize()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 158, in initialize
    self.data_generator.generate()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/data_generator/npz_generator.py", line 55, in generate
    np.savez(out_path_spec, x=records, y=record_labels)
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 639, in savez
    _savez(file, args, kwds, False)
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 736, in _savez
    zipf = zipfile_factory(file, mode="w", compression=compression)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
    return zipfile.ZipFile(file, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/zipfile.py", line 1284, in __init__
    self.fp = io.open(file, filemode)
              ^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train/img_0003_of_7000.npz'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=True', '++workload.workflow.train=False', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
    benchmark.initialize()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 158, in initialize
    self.data_generator.generate()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/data_generator/npz_generator.py", line 55, in generate
    np.savez(out_path_spec, x=records, y=record_labels)
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 639, in savez
    _savez(file, args, kwds, False)
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 736, in _savez
    zipf = zipfile_factory(file, mode="w", compression=compression)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
    return zipfile.ZipFile(file, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/zipfile.py", line 1284, in __init__
    self.fp = io.open(file, filemode)
              ^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train/img_0007_of_7000.npz'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=True', '++workload.workflow.train=False', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
    benchmark.initialize()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 158, in initialize
    self.data_generator.generate()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/data_generator/npz_generator.py", line 55, in generate
    np.savez(out_path_spec, x=records, y=record_labels)
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 639, in savez
    _savez(file, args, kwds, False)
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 736, in _savez
    zipf = zipfile_factory(file, mode="w", compression=compression)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
    return zipfile.ZipFile(file, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/zipfile.py", line 1284, in __init__
    self.fp = io.open(file, filemode)
              ^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train/img_0005_of_7000.npz'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[INFO] 2024-09-26T23:13:57.200507 Generating NPZ Data: [>------------------------------------------------------------] 0.0% 1 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:13:57.675216 Generating NPZ Data: [>------------------------------------------------------------] 0.1% 9 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:13:58.043502 Generating NPZ Data: [>------------------------------------------------------------] 0.2% 17 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:13:58.625244 Generating NPZ Data: [>------------------------------------------------------------] 0.4% 25 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:13:59.369874 Generating NPZ Data: [>------------------------------------------------------------] 0.5% 33 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:00.568655 Generating NPZ Data: [>------------------------------------------------------------] 0.6% 41 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:02.320880 Generating NPZ Data: [>------------------------------------------------------------] 0.7% 49 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:03.230448 Generating NPZ Data: [>------------------------------------------------------------] 0.8% 57 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:04.555247 Generating NPZ Data: [=>-----------------------------------------------------------] 0.9% 65 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:06.220732 Generating NPZ Data: [=>-----------------------------------------------------------] 1.0% 73 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]

Then later when I execute run, it fails.

[INFO] 2024-09-27T08:00:05.546065 Profiling DLIO /root/storage/resultsdir/trace-0-of-2.pfw [/root/storage/dlio_benchmark/dlio_benchmark/utils/config.py:189]
[INFO] 2024-09-27T08:00:05.546386 Running DLIO with 2 process(es) [/root/storage/dlio_benchmark/dlio_benchmark/main.py:100]
[INFO] Total amount of data each host will consume is 477.86366008222103 GB; each host has [9.694530487060547, 9.694538116455078] GB memory [/root/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:121]
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=False', '++workload.workflow.train=True', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
    benchmark.initialize()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 179, in initialize
    filenames = self.storage.walk_node(os.path.join(self.args.data_folder, f"{dataset_type}"))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/storage/file_storage.py", line 75, in walk_node
    return os.listdir(self.get_uri(id))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=False', '++workload.workflow.train=True', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
    benchmark.initialize()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 203, in initialize
    raise Exception(
Exception: Not enough training dataset is found; Please run the code with ++workload.workflow.generate_data=True

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

harisphnx avatar Sep 27 '24 08:09 harisphnx