doctr
doctr copied to clipboard
Can't run train_tensorflow.py on Google Colab GPU - only works on CPU
Bug description
Using "doctr/references/recognition/train_tensorflow.py" on Google Colab creates an error when I use GPU acceleration. If I only use the CPU everything works just fine.
Code snippet to reproduce the bug
Open Google Colab: https://colab.research.google.com
Add the code to the colab document
!git clone https://github.com/mindee/doctr.git
!pip install -e doctr/.
!pip install tf2onnx
# Contains data/train and data/val folders, each with a file "labels.json" and folder "images"
!curl -LO https://www.myserver.com/100k_files.zip
!unzip -qq 100k_files.zip
!python /content/doctr/references/recognition/train_tensorflow.py crnn_vgg16_bn --min-chars 5 --max-chars 5 --train_path data/train --val_path data/val --epochs 100
Change settings (menu bar):
Runtime -> Change runtime type:
- Python 3
- Hardware accelerator: CPU
--> Code runs without an issue
Change settings (menu bar):
"Runtime" -> "Change runtime type":
- Python 3
- Hardware accelerator: T4 GPU
--> Creates the error below (see traceback)
Error traceback
Traceback (most recent call last):
File "/content/doctr/references/recognition/train_tensorflow.py", line 448, in <module>
main(args)
File "/content/doctr/references/recognition/train_tensorflow.py", line 346, in main
fit_one_epoch(model, train_loader, batch_transforms, optimizer, args.amp)
File "/content/doctr/references/recognition/train_tensorflow.py", line 91, in fit_one_epoch
for images, targets in pbar:
File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
File "/content/doctr/doctr/datasets/loader.py", line 95, in __next__
samples = list(multithread_exec(self.dataset.__getitem__, indices, threads=self.num_workers))
File "/content/doctr/doctr/utils/multithreading.py", line 49, in multithread_exec
results = map(lambda x: x, tp.map(func, seq)) # noqa: C417
File "/usr/lib/python3.10/multiprocessing/pool.py", line 367, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get
raise self._value
File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/content/doctr/doctr/datasets/datasets/base.py", line 56, in __getitem__
img = self.img_transforms(img)
File "/content/doctr/doctr/transforms/modules/tensorflow.py", line 57, in __call__
x = t(x)
File "/content/doctr/doctr/transforms/modules/tensorflow.py", line 111, in __call__
img = tf.image.resize(img, self.wanted_size, self.method, self.preserve_aspect_ratio, self.antialias)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py", line 5883, in raise_from_not_ok_status
raise core._status_to_exception(e) from None # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__StridedSlice_device_/job:localhost/replica:0/task:0/device:GPU:0}} Index out of range using input dim 1; input has only 1 dims [Op:StridedSlice] name: strided_slice/
Sometimes I also get:
Traceback (most recent call last):
File "/content/doctr/references/recognition/train_tensorflow.py", line 448, in <module>
main(args)
File "/content/doctr/references/recognition/train_tensorflow.py", line 346, in main
fit_one_epoch(model, train_loader, batch_transforms, optimizer, args.amp)
File "/content/doctr/references/recognition/train_tensorflow.py", line 91, in fit_one_epoch
for images, targets in pbar:
File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
File "/content/doctr/doctr/datasets/loader.py", line 95, in __next__
samples = list(multithread_exec(self.dataset.__getitem__, indices, threads=self.num_workers))
File "/content/doctr/doctr/utils/multithreading.py", line 49, in multithread_exec
results = map(lambda x: x, tp.map(func, seq)) # noqa: C417
File "/usr/lib/python3.10/multiprocessing/pool.py", line 367, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get
raise self._value
File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/content/doctr/doctr/datasets/datasets/base.py", line 56, in __getitem__
img = self.img_transforms(img)
File "/content/doctr/doctr/transforms/modules/tensorflow.py", line 57, in __call__
x = t(x)
File "/content/doctr/doctr/transforms/modules/base.py", line 216, in __call__
return self.transform(img) if target is None else self.transform(img, target) # type: ignore[call-arg]
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/content/doctr/doctr/transforms/modules/tensorflow.py", line 401, in __call__
_gaussian_filter(
File "/content/doctr/doctr/transforms/functional/tensorflow.py", line 225, in _gaussian_filter
[(width - 1) // 2, width - 1 - (width - 1) // 2],
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__FloorDiv_device_/job:localhost/replica:0/task:0/device:GPU:0}} Integer division by zero [Op:FloorDiv] name:
Environment
DocTR version: 0.9.0a0 TensorFlow version: 2.15.0 PyTorch version: 2.2.1+cu121 (torchvision 0.17.1+cu121) OpenCV version: 4.8.0 OS: Ubuntu 22.04.3 LTS Python version: 3.10.12 Is CUDA available (TensorFlow): Yes Is CUDA available (PyTorch): Yes CUDA runtime version: 12.2.140 GPU models and configuration: GPU 0: Tesla T4 Nvidia driver version: 535.104.05 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6
Hi @fosple 👋 That's already an known issue we are on it :) CC @odulcy-mindee
As a workaround you can disable multiprocessing --> https://mindee.github.io/doctr/using_doctr/running_on_aws.html
This should fix the issue
Hi @fosple :wave: has it solved your problem ? :)
@felixdittrich92 Thanks for the super fast reply :) In the end I used the PyTorch version, as this one worked out of the box for me. But I can try the next days if your solution would solve this specific problem.
@fosple great so i think we can close this :)
Feel free to reopen if anything doesn't works :+1: