doctr Running OCR with a tensor produces different (and wrong) results w.r.t using a numpy array

Bug description

First off, thanks for the great software. Just to provide a little bit of context, we're trying to build an end-to-end system consisting of a (sort of) denoiser followed by doctr, and only fine-tune the denoiser so that it produces more readable results as measured by the loss of the results produced by doctr.

Thus, we would like to use tensors as inputs to the doctr modules. However, when using tensors the results are very different (and wrong) to when compared to using numpy arrays. The code below shows a simple example, which has been tested on google colab.

Code snippet to reproduce the bug

!pip3 install -U pip

## to avoid the RuntimeError: Given input size: (128x1x16). Calculated output size: (128x0x8). Output size is too small bug on colab
# see https://github.com/mindee/doctr/discussions/1884
!pip3 uninstall -y tensorFlow
# now yes, install doctr
!pip3 install "python-doctr[torch,viz]"

import cv2
import numpy as np
import torch
from doctr.models import ocr_predictor


# Function to generate a text image
def generate_text_image(text="HELLO", size=(128, 128)):
    img = np.ones(size, dtype=np.uint8) * 255  # White background
    font = cv2.FONT_HERSHEY_SIMPLEX
    font_scale = 1
    thickness = 2
    text_size = cv2.getTextSize(text, font, font_scale, thickness)[0]
    text_x = (size[1] - text_size[0]) // 2
    text_y = (size[0] + text_size[1]) // 2
    cv2.putText(img, text, (text_x, text_y), font, font_scale, (0,), thickness)
    return img

# Generate image
original_image = generate_text_image(text="HELLO how are you", size=(600, 600))

# Convert to tensor 
input_tensor = torch.tensor(original_image).unsqueeze(0).unsqueeze(0).repeat(1,3,1,1)  # Shape: [1, 3, H, W]
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load OCR model
ocr_model = ocr_predictor(pretrained=True).to(device)

#apply ocr_model to tensor (works wrong, and detects only several "-")
print(ocr_model(input_tensor))
#apply ocr_model to numpy array (works great and detects all the words)
threech_original_image_batch = np.expand_dims(np.stack((original_image,)*3, axis=-1),axis=0) # Shape: [1, H, W, 3]
print(ocr_model(threech_original_image_batch))

Error traceback

When run on the tensor we get the following (wrong) output:

Document(
  (pages): [Page(
    dimensions=torch.Size([600, 600])
    (blocks): [Block(
      (lines): [
        Line(
          (words): [Word(value='-', confidence=1.0)]
        ),
        Line(
          (words): [
            Word(value='-', confidence=1.0),
            Word(value='-', confidence=1.0),
            Word(value='-', confidence=1.0),
          ]
        ),
      ]
      (artefacts): []
    )]
  )]
)

and when run on the numpy array we get a correct output:

Document(
  (pages): [Page(
    dimensions=(600, 600)
    (blocks): [Block(
      (lines): [
        Line(
          (words): [Word(value='HELLO', confidence=1.0)]
        ),
        Line(
          (words): [
            Word(value='how', confidence=0.58),
            Word(value='are', confidence=0.97),
            Word(value='you', confidence=0.88),
          ]
        ),
      ]
      (artefacts): []
    )]
  )]
)

Environment

Collecting environment information...

DocTR version: v0.11.0 TensorFlow version: N/A PyTorch version: 2.6.0+cu124 (torchvision 0.21.0+cu124) OpenCV version: 4.11.0 OS: Ubuntu 22.04.4 LTS Python version: 3.11.11 Is CUDA available (TensorFlow): N/A Is CUDA available (PyTorch): No CUDA runtime version: 12.5.82 GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.2.1

Deep Learning backend

is_tf_available: False is_torch_available: True

Mar 28 '25 12:03 git-artes

Hi @git-artes 👋,

Thanks for reporting this and for providing the snippet to reproduce it! 👍

Actually, I don’t think you would benefit much from that, because at least after the text detection model, we have to switch to NumPy due to functionality that isn’t yet fully compatible with running in Torch. This starts as early as the detection post-processing step, which relies heavily on OpenCV.

That said, I agree that the type hint is misleading - and we should fix this.

Sounds pretty cool! If the denoiser is something you’d be able to share, we have a place for such modules:

https://mindee.github.io/doctr/using_doctr/using_contrib_modules.html https://github.com/mindee/doctr/tree/main/doctr/contrib

Mar 31 '25 12:03 felixdittrich92

Thanks. Regarding our objective, we've switched to trying to adapt https://github.com/mindee/doctr/blob/main/references/recognition/train_pytorch.py to our purposes, by adding the denoiser and adapting the loss to the training. Do you think that's the best way to go?

BTW, the "denoiser" is actually deep-tempest, our system for eavesdropping on HDMI. You can check it out at https://github.com/emidan19/deep-tempest/. I don't know if it fits the contrib modules, since it's somewhat niche, but if we get this combination working we'll let you know.

Mar 31 '25 12:03 git-artes

Hi @git-artes 👋,

Yeah I see - I worked on something similar doc_scanner years ago but with scanned docs as input :)

So actually it lacks on the recognition accuracy and not on the text detection ?

Apr 02 '25 12:04 felixdittrich92

Hi,

In fact we have not verified which of the sub-modules is the performance bottleneck, although it would be interesting to check this for further insights. However, our objective is to recover "easier to read" images for the operator/user, so our idea is to add the loss of the recognition to the total loss and perform a fine-tuning to our denoiser module only. The OCR part is left as-is since the actual reading will be performed by the human operator.

BTW sorry for the delay in the answer, I thought I've pressed "comment" but I did not...

best

Apr 07 '25 12:04 git-artes

#1967

We decided to go with numpy as consistent input for each predictor instance - typings are fixed For direct model calls it's still unchanged torch tensors

predictors - HWC (numpy) models - CHW (torch)

Jul 09 '25 06:07 felixdittrich92

I encountered the same issue when trying to turn inputs into tensors with the Pytorch backend.

I am trying to use tensors to improve performance, assuming using tensors on a GPU is faster than numpy on CPU. Is there any other way to take advantage of GPU for the input? Thanks!

Jul 17 '25 22:07 yinleon

Hi @yinleon 👋

We decided to keep numpy as fixed input for the predictor instances - because otherwise we need to switch under the hood between numpy/torch - cpu/gpu, because not all operations can be translated to torch (for example opencv operations).

For performance improvements you can try to compile the models:

https://mindee.github.io/doctr/using_doctr/using_model_export.html#compiling-your-models-pytorch-only

Or take a look at OnnxTR:

https://github.com/felixdittrich92/OnnxTR - which provides much more options to improve depending on your hardware

Hope this helps :)

Jul 18 '25 06:07 felixdittrich92

doctr doctr copied to clipboard

Running OCR with a tensor produces different (and wrong) results w.r.t using a numpy array

Bug description

Code snippet to reproduce the bug

Error traceback

Environment

Deep Learning backend

doctr
doctr copied to clipboard