f3dasm icon indicating copy to clipboard operation
f3dasm copied to clipboard

Failure to wait for the data_object creation

Open SNMS95 opened this issue 8 months ago • 3 comments

When I run a DOE with f3dasm, sometimes, a few nodes produce the following error and quit.

Error executing job with overrides: ['++hpc.jobid=4', 'hp_tune.model=baseline', 'hp_tune.model_seed=-1']
Traceback (most recent call last):
  File "/home/sanusm/.conda/envs/to_jax_env/lib/python3.9/site-packages/f3dasm/design/experimentdata.py", line 271, in _from_file_attempt
    domain = Domain.from_file(Path(f"{filename}_domain"))
  File "/home/sanusm/.conda/envs/to_jax_env/lib/python3.9/site-packages/f3dasm/design/domain.py", line 71, in from_file
    raise FileNotFoundError(f"Domain file {filename} does not exist.")
FileNotFoundError: Domain file exp_data_baseline_domain does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sanusm/.conda/envs/to_jax_env/lib/python3.9/site-packages/f3dasm/design/experimentdata.py", line 145, in from_file
    return cls._from_file_attempt(filename, text_io)
  File "/home/sanusm/.conda/envs/to_jax_env/lib/python3.9/site-packages/f3dasm/design/experimentdata.py", line 283, in _from_file_attempt
    raise FileNotFoundError(f"Cannot find the file {filename}_data.csv.")
FileNotFoundError: Cannot find the file exp_data_baseline_data.csv.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sanusm/.conda/envs/to_jax_env/lib/python3.9/site-packages/f3dasm/design/experimentdata.py", line 271, in _from_file_attempt
    domain = Domain.from_file(Path(f"{filename}_domain"))
  File "/home/sanusm/.conda/envs/to_jax_env/lib/python3.9/site-packages/f3dasm/design/domain.py", line 71, in from_file
    raise FileNotFoundError(f"Domain file {filename} does not exist.")
FileNotFoundError: Domain file /gpfs/home5/sanusm/phd/TO-JAX/experiments/benchmarking/hp_tuning_b/exp_data_baseline_domain does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/gpfs/home5/sanusm/phd/TO-JAX/experiments/benchmarking/hp_tuning_b/main.py", line 80, in main_func
    process(config)
  File "/gpfs/home5/sanusm/phd/TO-JAX/experiments/benchmarking/hp_tuning_b/main.py", line 62, in process
    data = f3dasm.ExperimentData.from_file(filename='exp_data_{}'.format(
  File "/home/sanusm/.conda/envs/to_jax_env/lib/python3.9/site-packages/f3dasm/design/experimentdata.py", line 152, in from_file
    return cls._from_file_attempt(filename_with_path, text_io)
  File "/home/sanusm/.conda/envs/to_jax_env/lib/python3.9/site-packages/f3dasm/design/experimentdata.py", line 283, in _from_file_attempt
    raise FileNotFoundError(f"Cannot find the file {filename}_data.csv.")
FileNotFoundError: Cannot find the file /gpfs/home5/sanusm/phd/TO-JAX/experiments/benchmarking/hp_tuning_b/exp_data_baseline_data.csv.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I am using version 1.3.0.

SNMS95 avatar Nov 03 '23 14:11 SNMS95