Keras-SRGAN icon indicating copy to clipboard operation
Keras-SRGAN copied to clipboard

Error in training

Open lchunleo opened this issue 5 years ago • 4 comments

Hi

i trying to run the training but encountered issue in the following.

UnboundLocalError: local variable 'discriminator_loss' referenced before assignment when trying to print("discriminator_loss : %f" % discriminator_loss)

lchunleo avatar Apr 27 '20 11:04 lchunleo

You still facing this issue?

deepak112 avatar May 04 '20 20:05 deepak112

You still facing this issue?

Thanks for checking. I managed to resolve the above issue due to some path issues.

I tried to perform the training on my own dataset on google colab but i am still unable to get it running. is it very resource intensive? i tried changing my batch size to almost the min but unable to do so. i resize my images to 384x384

--batch_size=2 --epochs=3 --number_of_images=805 --train_test_ratio=0.8

2020-05-05 02:12:47.508295: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-05-05 02:12:50.039906: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2300000000 Hz 2020-05-05 02:12:50.041986: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2b5d480 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-05-05 02:12:50.042028: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-05-05 02:12:50.044879: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-05-05 02:12:50.046654: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected 2020-05-05 02:12:50.046687: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (d1ec168119ba): /proc/driver/nvidia/version does not exist tcmalloc: large alloc 1139539968 bytes == 0x42f50000 @ 0x7fe4800d11e7 0x7fe47dc395e1 0x7fe47dc9e8e0 0x7fe47dd2c447 0x50ac25 0x50c5b9 0x509d48 0x50aa7d 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x509d48 0x50aa7d 0x50c5b9 0x508245 0x50b403 0x635222 0x6352d7 0x638a8f 0x639631 0x4b0f40 0x7fe47fcceb97 0x5b2fda tcmalloc: large alloc 1207959552 bytes == 0x97eaa000 @ 0x7fe4800b3b6b 0x7fe4800d3379 0x7fe3ee4b01f7 0x7fe3e260be4f 0x7fe3e2692e6b 0x7fe3e2501996 0x7fe3e250237b 0x7fe3e25024a7 0x7fe3ec966113 0x7fe3ec969fe7 0x7fe3e6d11544 0x7fe3e6d11d7f 0x7fe3e6ce4a9b 0x7fe3e6ce5670 0x7fe3e6d0d1c4 0x7fe3e6cdf64c 0x7fe3e6ce2c42 0x7fe3e68b616b 0x7fe3e68a4e11 0x7fe3e6556b71 0x7fe4703ba817 0x7fe4703dc4f4 0x50ac25 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50d390 0x508245 0x50a080 0x50aa7d --------------- Epoch 1 --------------- 0% 0/322 [00:00<?, ?it/s]2020-05-05 02:13:34.929150: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 12230590464 exceeds 10% of free system memory. 2020-05-05 02:13:34.929150: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 12230590464 exceeds 10% of free system memory. tcmalloc: large alloc 12230590464 bytes == 0x35285c000 @ 0x7fe4800b3b6b 0x7fe4800d3379 0x7fe3ee4b01f7 0x7fe3e260be4f 0x7fe3e2692e6b 0x7fe3e2501996 0x7fe3e2504a7d 0x7fe3ebdbd116 0x7fe3e271db42 0x7fe3e270fe85 0x7fe3e280c4e1 0x7fe3e28091d3 0x7fe47e9b36df 0x7fe47fa956db 0x7fe47fdce88f tcmalloc: large alloc 12230590464 bytes == 0x62c05c000 @ 0x7fe4800b3b6b 0x7fe4800d3379 0x7fe3ee4b01f7 0x7fe3e260be4f 0x7fe3e2692e6b 0x7fe3e2501996 0x7fe3e2504a7d 0x7fe3ebf21847 0x7fe3e271db42 0x7fe3e270fe85 0x7fe3e280c4e1 0x7fe3e28091d3 0x7fe47e9b36df 0x7fe47fa956db 0x7fe47fdce88f 0% 1/322 [00:42<3:47:39, 42.55s/it]Traceback (most recent call last): File "train.py", line 138, in train(values.epochs, values.batch_size, values.input_dir, values.output_dir, values.model_save_dir, values.number_of_images, values.train_test_ratio) File "train.py", line 83, in train d_loss_real = discriminator.train_on_batch(image_batch_hr, real_data_Y) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1514, in train_on_batch outputs = self.train_function(ins) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py", line 3792, in call outputs = self._graph_fn(*converted_inputs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1605, in call return self._call_impl(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1645, in _call_impl return self._call_flat(args, self.captured_inputs, cancellation_manager) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 598, in call ctx=ctx) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.FailedPreconditionError: Error while reading resource variable _AnonymousVar348 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/_AnonymousVar348/N10tensorflow3VarE does not exist. [[node mul_21/ReadVariableOp (defined at /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3009) ]] [Op:__inference_keras_scratch_graph_18053]

Function call stack: keras_scratch_graph

lchunleo avatar May 05 '20 02:05 lchunleo

Hello I can run network,utils files with out any error but when I try to run my training model I am getting an error and I am not understanding how to give input directory and what modifications we have to made to make to give a dataset of images as input for training purpose. Can you please help me to run this code Error: usage: ipykernel_launcher.py [-h] [-i INPUT_DIR] [-o OUTPUT_DIR] [-m MODEL_SAVE_DIR] [-b BATCH_SIZE] [-e EPOCHS] [-n NUMBER_OF_IMAGES] [-r TRAIN_TEST_RATIO] ipykernel_launcher.py: error: unrecognized arguments: -f /root/.local/share/jupyter/runtime/kernel-e1242cbc-2f55-4f63-aae5-b18b2fbfa737.json An exception has occurred, use %tb to see the full traceback.

SystemExit: 2 /usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:2890: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D. warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1) This is the one I encountered while I was running training code

RAKSHIT0406 avatar Oct 08 '20 11:10 RAKSHIT0406

Please all, I need the code of implementation this part the part is

{The SRResNet networks were trained with a learning rate of 10−4 and 106 update iterations. We employed the trained MSE-based SRResNet network as initialization for the generator when training the actual GAN to avoid undesired local optima.{

BassantTolba1234 avatar Nov 28 '20 09:11 BassantTolba1234