dl-keras-tf
dl-keras-tf copied to clipboard
Error running code cnn-train code chunk in 02-cats-vs-dogs.Rmd
When running the following code chunk from a fresh session I get the following error:
history <- model %>% fit_generator(
train_generator,
steps_per_epoch = 100,
epochs = 30,
validation_data = validation_generator,
validation_steps = 50,
callbacks = callback_early_stopping(patience = 5)
)
WARNING:tensorflow:sample_weight modes were coerced from
...
to
['...']
2020-01-28 00:41:21.650047: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-01-28 00:41:21.887585: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-28 00:41:22.620623: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 16.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Error in py_call_impl(callable, dots$args, dots$keywords) : ResourceExhaustedError: OOM when allocating tensor with shape[6272,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node MatMul_3 (defined at /util/deprecation.py:324) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [Op:__inference_distributed_function_1290] Function call stack: distributed_function
Just adding a note that this issue is related to running code on Rstudio Server (including GPU).
I can reproduce this problem. Investigating.
I think I'm running out of GPU memory. Which wasn't a problem before. I do have two sessions running but not sure if that's relevant.
Dropping the batch size to 5 has got it moving again.
I've tried dropping the batch size to 5, but am still getting errors. The code progresses through all 20 epochs, whereas it was stopping at the first with a larger batch size
> history <-
+ model %>%
+ fit_generator(
+ train_generator,
+ steps_per_epoch = 100,
+ epochs = 30,
+ validation_data = validation_generator,
+ validation_steps = 50,
+ callbacks = callback_early_stopping(patience = 5)
+ )
WARNING:tensorflow:sample_weight modes were coerced from
...
to
['...']
WARNING:tensorflow:sample_weight modes were coerced from
...
to
['...']
2020-01-29 00:01:20.648842: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-01-29 00:01:20.827311: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-29 00:01:21.513376: W tensorflow/core/common_runtime/bfc_allocator.cc:243] Allocator (GPU_0_bfc) ran out of memory trying to allocate 98.12MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
... snip ...
Found 2000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.
Train for 100 steps, validate for 50 steps
Epoch 1/30
100/100 [==============================] - 7s 73ms/step - loss: 0.6969 - accuracy: 0.5080 - val_loss: 0.6818 - val_accuracy: 0.5480
2020-01-29 00:01:27.273011: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
Epoch 2/30
100/100 [==============================] - 3s 32ms/step - loss: 0.6927 - accuracy: 0.5180 - val_loss: 0.6750 - val_accuracy: 0.5480ernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2020-01-29 00:01:30.518277: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled