Retinanet-Tutorial icon indicating copy to clipboard operation
Retinanet-Tutorial copied to clipboard

Problem when run train.py

Open ghost opened this issue 5 years ago • 6 comments

Hi Jasper,

I am at the step of running the "train.py" script but I got this problem with the "_batch_normalization.py", do you by chance know what is causing this?

I ran the following command:

python3 /keras-retinanet/keras_retinanet/bin/train.py --tensorboard-dir ~/Garden/TOB/TrainingOutput --snapshot-path ~/Garden/TOB/TrainingOutput/snapshots --random-transform --steps 100 pascal ~/Garden/TOB/TouristVOC


Here is the Error

2020-08-13 11:57:48.309407: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 Creating model, this may take a second... 2020-08-13 11:57:49.339186: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1 2020-08-13 11:57:49.348649: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-08-13 11:57:49.349025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce GT 630 computeCapability: 3.0 coreClock: 0.8755GHz coreCount: 1 deviceMemorySize: 1.94GiB deviceMemoryBandwidth: 26.55GiB/s 2020-08-13 11:57:49.349053: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2020-08-13 11:57:49.350303: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2020-08-13 11:57:49.351523: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2020-08-13 11:57:49.351754: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2020-08-13 11:57:49.353637: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2020-08-13 11:57:49.354443: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2020-08-13 11:57:49.354591: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory 2020-08-13 11:57:49.354603: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2020-08-13 11:57:49.354849: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2020-08-13 11:57:49.379606: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3392355000 Hz 2020-08-13 11:57:49.380015: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x530f700 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-08-13 11:57:49.380035: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-08-13 11:57:49.381953: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-08-13 11:57:49.381971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]
Traceback (most recent call last): File "/home/user/Garden/TOB/Repos/keras-retinanet/keras_retinanet/bin/train.py", line 547, in main() File "/home/user/Garden/TOB/Repos/keras-retinanet/keras_retinanet/bin/train.py", line 500, in main model, training_model, prediction_model = create_models( File "/home/user/Garden/TOB/Repos/keras-retinanet/keras_retinanet/bin/train.py", line 117, in create_models model = model_with_weights(backbone_retinanet(num_classes, num_anchors=num_anchors, modifier=modifier, pyramid_levels=pyramid_levels), weights=weights, skip_mismatch=True) File "/home/user/Garden/TOB/Repos/keras-retinanet/keras_retinanet/bin/../../keras_retinanet/models/resnet.py", line 38, in retinanet return resnet_retinanet(*args, backbone=self.backbone, **kwargs) File "/home/user/Garden/TOB/Repos/keras-retinanet/keras_retinanet/bin/../../keras_retinanet/models/resnet.py", line 99, in resnet_retinanet resnet = keras_resnet.models.ResNet50(inputs, include_top=False, freeze_bn=True) File "/home/user/.local/lib/python3.8/site-packages/keras_resnet/models/_2d.py", line 188, in ResNet50 return ResNet(inputs, blocks, numerical_names=numerical_names, block=keras_resnet.blocks.bottleneck_2d, include_top=include_top, classes=classes, *args, **kwargs) File "/home/user/.local/lib/python3.8/site-packages/keras_resnet/models/_2d.py", line 66, in ResNet x = keras_resnet.layers.BatchNormalization(axis=axis, epsilon=1e-5, freeze=freeze_bn, name="bn_conv1")(x) File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer.py", line 925, in call return self._functional_construction_call(inputs, args, kwargs, File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1117, in _functional_construction_call outputs = call_fn(cast_inputs, *args, **kwargs) File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 258, in wrapper raise e.ag_error_metadata.to_exception(e) TypeError: in user code:

/home/user/.local/lib/python3.8/site-packages/keras_resnet/layers/_batch_normalization.py:17 call  *
    return super(BatchNormalization, self).call(training=(not self.freeze), *args, **kwargs)

TypeError: type object got multiple values for keyword argument 'training'

Thank you in advance for your help man

ghost avatar Aug 13 '20 10:08 ghost

Sounds like the keras may have refactored this code in a version change. I think this describes the issue: https://stackoverflow.com/questions/62629864/typeerror-type-object-got-multiple-values-for-keyword-argument-training

and the fix is to use keras version 2.3.1 instead of the newest one.

jaspereb avatar Aug 14 '20 00:08 jaspereb

Thank you for your response and I was able to run the script after degraded keras version to 2.3.1. However I am currently running into a problem of not able to train the model due to "UnreadVariable" as below:


None /home/user/.local/lib/python3.8/site-packages/keras/callbacks/tensorboard_v2.py:92: UserWarning: The TensorBoard callback batch_size argument (for histogram computation) is deprecated with TensorFlow 2.0. It will be ignored. warnings.warn('The TensorBoard callback batch_size argument ' 2020-08-14 11:36:10.041933: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started. 2020-08-14 11:36:10.043050: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1391] Profiler found 1 GPUs 2020-08-14 11:36:10.043239: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcupti.so.10.1'; dlerror: libcupti.so.10.1: cannot open shared object file: No such file or directory 2020-08-14 11:36:10.043314: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcupti.so'; dlerror: libcupti.so: cannot open shared object file: No such file or directory 2020-08-14 11:36:10.043325: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found. Traceback (most recent call last): File "keras_retinanet/bin/train.py", line 547, in main() File "keras_retinanet/bin/train.py", line 532, in main return training_model.fit_generator( File "/home/user/.local/lib/python3.8/site-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, **kwargs) File "/home/user/.local/lib/python3.8/site-packages/keras/engine/training.py", line 1718, in fit_generator return training_generator.fit_generator( File "/home/user/.local/lib/python3.8/site-packages/keras/engine/training_generator.py", line 42, in fit_generator model._make_train_function() File "/home/user/.local/lib/python3.8/site-packages/keras/engine/training.py", line 328, in _make_train_function self.train_function = K.function( File "/home/user/.local/lib/python3.8/site-packages/keras/backend/tensorflow_backend.py", line 3007, in function return tf_keras_backend.function(inputs, outputs, File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/keras/backend.py", line 3932, in function raise ValueError('updates argument is not supported during ' ValueError: updates argument is not supported during eager execution. You passed: [<tf.Variable 'UnreadVariable' shape=() dtype=int64, numpy=0>, <tf.Variable 'UnreadVariable' shape=(7, 7, 3, 64) dtype=float32, numpy= array([[[[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]],

    [[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]],

Do you think this is because of my CUDA and cuDNN versions are not compatible with keras-retinanet ? I am currently using Ubuntu 20.4, Cuda 11.0.2, and cuDNN 8.0.2. Thank you in advance for your help :D

ghost avatar Aug 14 '20 09:08 ghost

It's probably also a versioning issue. Might be CUDA but there are some people reporting that eager execution issue for different tensorflow versions. https://github.com/tensorflow/tensorflow/issues/20372

Can you install tensorflow 1.11?

Otherwise, you may need to look at eager/non-eager execution https://github.com/tensorflow/tensorflow/issues/41239

jaspereb avatar Aug 17 '20 04:08 jaspereb

Thank you for your suggestion and I am able to run the train.py script now. However, I am currently running into a new problem when trying to open up tensorboard, do you by chance know what this problem is about ?


tensorboard --logdir ~/Garden/TOB/TrainingOutput ^[[DW0817 10:59:11.736817 140711268804352 plugin_event_accumulator.py:321] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. W0817 10:59:11.761196 140711268804352 plugin_event_accumulator.py:321] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. W0817 10:59:11.764438 140711268804352 plugin_event_accumulator.py:321] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. W0817 10:59:11.768529 140711268804352 plugin_event_accumulator.py:321] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. W0817 10:59:11.772261 140711268804352 plugin_event_accumulator.py:321] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event. Traceback (most recent call last): File "/home/user/.local/bin/tensorboard", line 8, in sys.exit(run_main()) File "/home/user/.local/lib/python3.8/site-packages/tensorboard/main.py", line 75, in run_main app.run(tensorboard.main, flags_parser=tensorboard.configure) File "/home/user/.local/lib/python3.8/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/home/user/.local/lib/python3.8/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/home/user/.local/lib/python3.8/site-packages/tensorboard/program.py", line 289, in main return runner(self.flags) or 0 File "/home/user/.local/lib/python3.8/site-packages/tensorboard/program.py", line 305, in _run_serve_subcommand server = self._make_server() File "/home/user/.local/lib/python3.8/site-packages/tensorboard/program.py", line 408, in _make_server app = application.standard_tensorboard_wsgi( File "/home/user/.local/lib/python3.8/site-packages/tensorboard/backend/application.py", line 146, in standard_tensorboard_wsgi return TensorBoardWSGIApp( File "/home/user/.local/lib/python3.8/site-packages/tensorboard/backend/application.py", line 225, in TensorBoardWSGIApp return TensorBoardWSGI( File "/home/user/.local/lib/python3.8/site-packages/tensorboard/backend/application.py", line 298, in init raise ValueError( ValueError: Duplicate plugins for name projector


ghost avatar Aug 17 '20 09:08 ghost

https://github.com/tensorflow/tensorboard/issues/3443

jaspereb avatar Aug 18 '20 23:08 jaspereb

@hiennguyentum Hi I am also facing the same error as the one you described below Can you please elaborate how you were able to mitigate this? I could not understand the solution quoted above.

Thanks in Advance, Vishal

Thank you for your response and I was able to run the script after degraded keras version to 2.3.1. However I am currently running into a problem of not able to train the model due to "UnreadVariable" as below:

None /home/user/.local/lib/python3.8/site-packages/keras/callbacks/tensorboard_v2.py:92: UserWarning: The TensorBoard callback batch_size argument (for histogram computation) is deprecated with TensorFlow 2.0. It will be ignored. warnings.warn('The TensorBoard callback batch_size argument ' 2020-08-14 11:36:10.041933: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started. 2020-08-14 11:36:10.043050: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1391] Profiler found 1 GPUs 2020-08-14 11:36:10.043239: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcupti.so.10.1'; dlerror: libcupti.so.10.1: cannot open shared object file: No such file or directory 2020-08-14 11:36:10.043314: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcupti.so'; dlerror: libcupti.so: cannot open shared object file: No such file or directory 2020-08-14 11:36:10.043325: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1441] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found. Traceback (most recent call last): File "keras_retinanet/bin/train.py", line 547, in main() File "keras_retinanet/bin/train.py", line 532, in main return training_model.fit_generator( File "/home/user/.local/lib/python3.8/site-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, **kwargs) File "/home/user/.local/lib/python3.8/site-packages/keras/engine/training.py", line 1718, in fit_generator return training_generator.fit_generator( File "/home/user/.local/lib/python3.8/site-packages/keras/engine/training_generator.py", line 42, in fit_generator model._make_train_function() File "/home/user/.local/lib/python3.8/site-packages/keras/engine/training.py", line 328, in _make_train_function self.train_function = K.function( File "/home/user/.local/lib/python3.8/site-packages/keras/backend/tensorflow_backend.py", line 3007, in function return tf_keras_backend.function(inputs, outputs, File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/keras/backend.py", line 3932, in function raise ValueError('updates argument is not supported during ' ValueError: updates argument is not supported during eager execution. You passed: [<tf.Variable 'UnreadVariable' shape=() dtype=int64, numpy=0>, <tf.Variable 'UnreadVariable' shape=(7, 7, 3, 64) dtype=float32, numpy= array([[[[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]],

    [[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]],

Do you think this is because of my CUDA and cuDNN versions are not compatible with keras-retinanet ? I am currently using Ubuntu 20.4, Cuda 11.0.2, and cuDNN 8.0.2. Thank you in advance for your help :D

visriv avatar Dec 09 '20 09:12 visriv