Problem when run train.py
Hi Jasper,
I am at the step of running the "train.py" script but I got this problem with the "_batch_normalization.py", do you by chance know what is causing this?
I ran the following command:
python3 /keras-retinanet/keras_retinanet/bin/train.py --tensorboard-dir ~/Garden/TOB/TrainingOutput --snapshot-path ~/Garden/TOB/TrainingOutput/snapshots --random-transform --steps 100 pascal ~/Garden/TOB/TouristVOC
Here is the Error
2020-08-13 11:57:48.309407: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Creating model, this may take a second...
2020-08-13 11:57:49.339186: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-08-13 11:57:49.348649: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-13 11:57:49.349025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GT 630 computeCapability: 3.0
coreClock: 0.8755GHz coreCount: 1 deviceMemorySize: 1.94GiB deviceMemoryBandwidth: 26.55GiB/s
2020-08-13 11:57:49.349053: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-08-13 11:57:49.350303: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-08-13 11:57:49.351523: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-08-13 11:57:49.351754: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-08-13 11:57:49.353637: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-08-13 11:57:49.354443: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-08-13 11:57:49.354591: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2020-08-13 11:57:49.354603: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-08-13 11:57:49.354849: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-08-13 11:57:49.379606: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3392355000 Hz
2020-08-13 11:57:49.380015: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x530f700 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-13 11:57:49.380035: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-08-13 11:57:49.381953: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-13 11:57:49.381971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]
Traceback (most recent call last):
File "/home/user/Garden/TOB/Repos/keras-retinanet/keras_retinanet/bin/train.py", line 547, in
/home/user/.local/lib/python3.8/site-packages/keras_resnet/layers/_batch_normalization.py:17 call *
return super(BatchNormalization, self).call(training=(not self.freeze), *args, **kwargs)
TypeError: type object got multiple values for keyword argument 'training'
Thank you in advance for your help man
Sounds like the keras may have refactored this code in a version change. I think this describes the issue: https://stackoverflow.com/questions/62629864/typeerror-type-object-got-multiple-values-for-keyword-argument-training
and the fix is to use keras version 2.3.1 instead of the newest one.
Thank you for your response and I was able to run the script after degraded keras version to 2.3.1. However I am currently running into a problem of not able to train the model due to "UnreadVariable" as below:
None
/home/user/.local/lib/python3.8/site-packages/keras/callbacks/tensorboard_v2.py:92: UserWarning: The TensorBoard callback batch_size argument (for histogram computation) is deprecated with TensorFlow 2.0. It will be ignored.
warnings.warn('The TensorBoard callback batch_size argument '
2020-08-14 11:36:10.041933: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-08-14 11:36:10.043050: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1391] Profiler found 1 GPUs
2020-08-14 11:36:10.043239: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcupti.so.10.1'; dlerror: libcupti.so.10.1: cannot open shared object file: No such file or directory
2020-08-14 11:36:10.043314: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcupti.so'; dlerror: libcupti.so: cannot open shared object file: No such file or directory
2020-08-14 11:36:10.043325: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
Traceback (most recent call last):
File "keras_retinanet/bin/train.py", line 547, in updates argument is not supported during '
ValueError: updates argument is not supported during eager execution. You passed: [<tf.Variable 'UnreadVariable' shape=() dtype=int64, numpy=0>, <tf.Variable 'UnreadVariable' shape=(7, 7, 3, 64) dtype=float32, numpy=
array([[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
Do you think this is because of my CUDA and cuDNN versions are not compatible with keras-retinanet ? I am currently using Ubuntu 20.4, Cuda 11.0.2, and cuDNN 8.0.2. Thank you in advance for your help :D
It's probably also a versioning issue. Might be CUDA but there are some people reporting that eager execution issue for different tensorflow versions. https://github.com/tensorflow/tensorflow/issues/20372
Can you install tensorflow 1.11?
Otherwise, you may need to look at eager/non-eager execution https://github.com/tensorflow/tensorflow/issues/41239
Thank you for your suggestion and I am able to run the train.py script now. However, I am currently running into a new problem when trying to open up tensorboard, do you by chance know what this problem is about ?
tensorboard --logdir ~/Garden/TOB/TrainingOutput
^[[DW0817 10:59:11.736817 140711268804352 plugin_event_accumulator.py:321] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event.
W0817 10:59:11.761196 140711268804352 plugin_event_accumulator.py:321] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event.
W0817 10:59:11.764438 140711268804352 plugin_event_accumulator.py:321] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event.
W0817 10:59:11.768529 140711268804352 plugin_event_accumulator.py:321] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event.
W0817 10:59:11.772261 140711268804352 plugin_event_accumulator.py:321] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event.
Traceback (most recent call last):
File "/home/user/.local/bin/tensorboard", line 8, in
https://github.com/tensorflow/tensorboard/issues/3443
@hiennguyentum Hi I am also facing the same error as the one you described below Can you please elaborate how you were able to mitigate this? I could not understand the solution quoted above.
Thanks in Advance, Vishal
Thank you for your response and I was able to run the script after degraded keras version to 2.3.1. However I am currently running into a problem of not able to train the model due to "UnreadVariable" as below:
None /home/user/.local/lib/python3.8/site-packages/keras/callbacks/tensorboard_v2.py:92: UserWarning: The TensorBoard callback
batch_sizeargument (for histogram computation) is deprecated with TensorFlow 2.0. It will be ignored. warnings.warn('The TensorBoard callbackbatch_sizeargument ' 2020-08-14 11:36:10.041933: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started. 2020-08-14 11:36:10.043050: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1391] Profiler found 1 GPUs 2020-08-14 11:36:10.043239: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcupti.so.10.1'; dlerror: libcupti.so.10.1: cannot open shared object file: No such file or directory 2020-08-14 11:36:10.043314: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcupti.so'; dlerror: libcupti.so: cannot open shared object file: No such file or directory 2020-08-14 11:36:10.043325: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found. Traceback (most recent call last): File "keras_retinanet/bin/train.py", line 547, in main() File "keras_retinanet/bin/train.py", line 532, in main return training_model.fit_generator( File "/home/user/.local/lib/python3.8/site-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, **kwargs) File "/home/user/.local/lib/python3.8/site-packages/keras/engine/training.py", line 1718, in fit_generator return training_generator.fit_generator( File "/home/user/.local/lib/python3.8/site-packages/keras/engine/training_generator.py", line 42, in fit_generator model._make_train_function() File "/home/user/.local/lib/python3.8/site-packages/keras/engine/training.py", line 328, in _make_train_function self.train_function = K.function( File "/home/user/.local/lib/python3.8/site-packages/keras/backend/tensorflow_backend.py", line 3007, in function return tf_keras_backend.function(inputs, outputs, File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/keras/backend.py", line 3932, in function raise ValueError('updatesargument is not supported during ' ValueError:updatesargument is not supported during eager execution. You passed: [<tf.Variable 'UnreadVariable' shape=() dtype=int64, numpy=0>, <tf.Variable 'UnreadVariable' shape=(7, 7, 3, 64) dtype=float32, numpy= array([[[[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]],[[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]],Do you think this is because of my CUDA and cuDNN versions are not compatible with keras-retinanet ? I am currently using Ubuntu 20.4, Cuda 11.0.2, and cuDNN 8.0.2. Thank you in advance for your help :D