storage
storage copied to clipboard
Python stack trace while executing datagen and run
I get the following stack trace while executing datagen, but after that the datagen continues normally. It does not finish though. I left it running overnight by morning it has finished.
[INFO] 2024-09-26T23:13:56.862045 Starting data generation [/root/storage/dlio_benchmark/dlio_benchmark/main.py:157]
[INFO] 2024-09-26T23:13:56.862436 Generating dataset in unet3d_data/train and unet3d_data/valid [/root/storage/dlio_benchmark/dlio_benchmark/data_generator/data_generator.py:77]
[INFO] 2024-09-26T23:13:56.862501 Number of files for training dataset: 7000 [/root/storage/dlio_benchmark/dlio_benchmark/data_generator/data_generator.py:78]
[INFO] 2024-09-26T23:13:56.862548 Number of files for validation dataset: 0 [/root/storage/dlio_benchmark/dlio_benchmark/data_generator/data_generator.py:79]
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=True', '++workload.workflow.train=False', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
benchmark.initialize()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 158, in initialize
self.data_generator.generate()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/data_generator/npz_generator.py", line 55, in generate
np.savez(out_path_spec, x=records, y=record_labels)
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 639, in savez
_savez(file, args, kwds, False)
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 736, in _savez
zipf = zipfile_factory(file, mode="w", compression=compression)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
return zipfile.ZipFile(file, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/zipfile.py", line 1284, in __init__
self.fp = io.open(file, filemode)
^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train/img_0001_of_7000.npz'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=True', '++workload.workflow.train=False', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
benchmark.initialize()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 158, in initialize
self.data_generator.generate()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/data_generator/npz_generator.py", line 55, in generate
np.savez(out_path_spec, x=records, y=record_labels)
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 639, in savez
_savez(file, args, kwds, False)
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 736, in _savez
zipf = zipfile_factory(file, mode="w", compression=compression)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
return zipfile.ZipFile(file, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/zipfile.py", line 1284, in __init__
self.fp = io.open(file, filemode)
^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train/img_0003_of_7000.npz'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=True', '++workload.workflow.train=False', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
benchmark.initialize()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 158, in initialize
self.data_generator.generate()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/data_generator/npz_generator.py", line 55, in generate
np.savez(out_path_spec, x=records, y=record_labels)
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 639, in savez
_savez(file, args, kwds, False)
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 736, in _savez
zipf = zipfile_factory(file, mode="w", compression=compression)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
return zipfile.ZipFile(file, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/zipfile.py", line 1284, in __init__
self.fp = io.open(file, filemode)
^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train/img_0007_of_7000.npz'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=True', '++workload.workflow.train=False', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
benchmark.initialize()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 158, in initialize
self.data_generator.generate()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/data_generator/npz_generator.py", line 55, in generate
np.savez(out_path_spec, x=records, y=record_labels)
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 639, in savez
_savez(file, args, kwds, False)
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 736, in _savez
zipf = zipfile_factory(file, mode="w", compression=compression)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
return zipfile.ZipFile(file, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/zipfile.py", line 1284, in __init__
self.fp = io.open(file, filemode)
^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train/img_0005_of_7000.npz'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[INFO] 2024-09-26T23:13:57.200507 Generating NPZ Data: [>------------------------------------------------------------] 0.0% 1 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:13:57.675216 Generating NPZ Data: [>------------------------------------------------------------] 0.1% 9 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:13:58.043502 Generating NPZ Data: [>------------------------------------------------------------] 0.2% 17 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:13:58.625244 Generating NPZ Data: [>------------------------------------------------------------] 0.4% 25 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:13:59.369874 Generating NPZ Data: [>------------------------------------------------------------] 0.5% 33 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:00.568655 Generating NPZ Data: [>------------------------------------------------------------] 0.6% 41 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:02.320880 Generating NPZ Data: [>------------------------------------------------------------] 0.7% 49 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:03.230448 Generating NPZ Data: [>------------------------------------------------------------] 0.8% 57 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:04.555247 Generating NPZ Data: [=>-----------------------------------------------------------] 0.9% 65 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:06.220732 Generating NPZ Data: [=>-----------------------------------------------------------] 1.0% 73 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
Then later when I execute run, it fails.
[INFO] 2024-09-27T08:00:05.546065 Profiling DLIO /root/storage/resultsdir/trace-0-of-2.pfw [/root/storage/dlio_benchmark/dlio_benchmark/utils/config.py:189]
[INFO] 2024-09-27T08:00:05.546386 Running DLIO with 2 process(es) [/root/storage/dlio_benchmark/dlio_benchmark/main.py:100]
[INFO] Total amount of data each host will consume is 477.86366008222103 GB; each host has [9.694530487060547, 9.694538116455078] GB memory [/root/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:121]
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=False', '++workload.workflow.train=True', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
benchmark.initialize()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 179, in initialize
filenames = self.storage.walk_node(os.path.join(self.args.data_folder, f"{dataset_type}"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/storage/file_storage.py", line 75, in walk_node
return os.listdir(self.get_uri(id))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=False', '++workload.workflow.train=True', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
benchmark.initialize()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 203, in initialize
raise Exception(
Exception: Not enough training dataset is found; Please run the code with ++workload.workflow.generate_data=True
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.