ValueError: The filepath provided must end in `.keras` (Keras model format)
Hi!
I have been trying to use AiZynthTrain to train AiZynthFinder with some personal reactions and reaction template. I have mapped and cleaned the reaction and template files with my own protocols and my goal is to retrain AiZynthFinder without running any additional cleaning/preparation step.
I have used the expansion pipeline with the following config file:
expansion_model_pipeline:
python_kernel: aizynthtrain
file_prefix: test
nbatches: 200
training_fraction: 0.9
random_seed: 1689
selected_ids_path: "lookup_templates.json"
And I got the following errors during training:
2024-09-16 13:44:07.814 [1726483142464316/model_training/206 (pid 3123416)] Task is starting.
2024-09-16 13:44:08.591 [1726483142464316/model_training/206 (pid 3123416)] 2024-09-16 13:44:08.591729: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-16 13:44:08.604 [1726483142464316/model_training/206 (pid 3123416)] 2024-09-16 13:44:08.604123: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-16 13:44:08.607 [1726483142464316/model_training/206 (pid 3123416)] 2024-09-16 13:44:08.607848: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-16 13:44:13.767 [1726483142464316/model_training/206 (pid 3123416)] <flow ExpansionModelFlow step model_training> failed:
2024-09-16 13:44:13.873 [1726483142464316/model_training/206 (pid 3123416)] Internal error
2024-09-16 13:44:13.875 [1726483142464316/model_training/206 (pid 3123416)] Traceback (most recent call last):
2024-09-16 13:44:13.875 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/cli.py", line 1134, in main
2024-09-16 13:44:13.875 [1726483142464316/model_training/206 (pid 3123416)] start(auto_envvar_prefix="METAFLOW", obj=state)
2024-09-16 13:44:13.875 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/tracing/__init__.py", line 27, in wrapper_func
2024-09-16 13:44:13.875 [1726483142464316/model_training/206 (pid 3123416)] return func(args, kwargs)
2024-09-16 13:44:14.668 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 829, in __call__
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] return self.main(args, kwargs)
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 782, in main
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] rv = self.invoke(ctx)
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 1259, in invoke
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 1066, in invoke
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] return ctx.invoke(self.callback, ctx.params)
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 610, in invoke
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] return callback(args, kwargs)
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/_vendor/click/decorators.py", line 21, in new_func
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] return f(get_current_context(), args, kwargs)
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/cli.py", line 468, in step
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] task.run_step(
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/task.py", line 650, in run_step
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] self._exec_step_function(step_func)
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/metaflow/task.py", line 62, in _exec_step_function
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] step_function()
2024-09-16 13:44:14.669 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/aizynthtrain/pipelines/expansion_model_pipeline.py", line 83, in model_training
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] training_runner([self.config_path])
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/aizynthtrain/modelling/expansion_policy/training.py", line 83, in main
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] callbacks = setup_callbacks(
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/aizynthtrain/utils/keras_utils.py", line 76, in setup_callbacks
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] checkpoint = ModelCheckpoint(
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] File "/data/users/carespos/conda/envs/aizynthtrain/lib/python3.10/site-packages/keras/src/callbacks/model_checkpoint.py", line 191, in __init__
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] raise ValueError(
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] ValueError: The filepath provided must end in `.keras` (Keras model format). Received: filepath=test_keras_model.hdf5
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)]
2024-09-16 13:44:14.674 [1726483142464316/model_training/206 (pid 3123416)] Task failed.
2024-09-16 13:44:14.679 Workflow failed.
2024-09-16 13:44:14.679 Terminating 0 active tasks...
2024-09-16 13:44:14.679 Flushing logs...
Step failure:
Step model_training (task-id 206) failed.
where the final error is:
2024-09-16 13:44:14.670 [1726483142464316/model_training/206 (pid 3123416)] ValueError: The filepath provided must end in `.keras` (Keras model format). Received: filepath=test_keras_model.hdf5
-
Could you please let me know how to fix/debug this?
-
And is this the right pipeline to use when wanting to train AiZynthFinder without running any preparation step?
Many thanks!
Carmen
I fixed it by installing specific keras and tensorflow versions:
pip install keras==2.8.0
pip install tensorflow==2.8.0
pip install tensorboard==2.8.0
pip install tensorflow-serving-api==2.8.0
To avoid this issue to occur in the future, the dependencies could be added to the pyproject.toml.
However, I got now another error during validation:
FileNotFoundError: [Errno 2] No such file or directory: 'testing_template_library.csv'
Even if I did not configure the validation pipeline, it seems it's still running it.
Best, Carmen
Thanks for this. We are aware of these issues with some version of tensorflow and/or keras. We are working on a re-factored codebase that will address this.
Hi @SGenheden, are there any timelines for refactoring the codebase? I am able to run the expansion policy pipeline but only on CPU, which takes around a day.. Thank you!
FYI for anyone interested in setting this up, you need to install cuda 11 in your conda env