doctr Partial detection in the images which has more width than height.

Bug description

I have tried the detection model on the images which have more width than height but in all of the images, the model is not able to detect the words that are present on the right side of the image. For reference:

issue

This issue is persistent in most of the images which have longer widths than height.

Code snippet to reproduce the bug

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
result = model(doc)

the bounding box in result variable is not only for the partial image.

Error traceback

None

Environment

python - 3.8.10 torch - 2.0.0+cu117 doctr - 0.6.0

Deep Learning backend

is_tf_available: False is_torch_available: True

Feb 26 '24 15:02 ashishupadhyaa

Hi @ashishupadhyaa 👋, Please update doctr to 0.7.0 or higher (0.8.0 release will be this week) :)

Feb 26 '24 17:02 felixdittrich92

Hi @ashishupadhyaa 👋, Please update doctr to 0.7.0 or higher (0.8.0 release will be this week) :)

Thanks for the reply @felixdittrich92 I have checked it with the 0.7.0 version and the issue is still there.

Feb 27 '24 10:02 ashishupadhyaa

@ashishupadhyaa I tried your attached image with the latest dev version (0.8.0a) and it works could you attach the original image maybe ?

Feb 27 '24 10:02 felixdittrich92

@felixdittrich92 Sorry I won't be able to post the original image but I can wait until you release the new release as it will be released this week itself. :blush:

Feb 27 '24 11:02 ashishupadhyaa

Released :)

Feb 28 '24 14:02 felixdittrich92

Hi @felixdittrich92

I have separated the detection and recognition models as I have to do some post-processing on the cropped images. I have a fine-tuned db_resnet model for detecting the words and it is giving me partial detection only in the 0.8.0 version as well but the pre-trained models provided in the repo are detecting it correctly. Can you tell me if is it because of the training or because I am leaving out some post-processing behind when converting the output of the detection model to proper bounding boxes?

I am attaching the reference code below:

from doctr.models import (
    crnn_vgg16_bn, db_resnet50, recognition_predictor, detection_predictor
)

device_type = 'cuda'
det_model_path = 'db_resnet50_20231005-072017.pt'
det_model = db_resnet50(pretrained=False)
det_params = torch.load(
    det_model_path, map_location=torch.device(device_type)
)
det_model.load_state_dict(det_params)

def geometry_to_bbox(geometry, page_dim):
    len_x = page_dim[1]
    len_y = page_dim[0]
    (x_min, y_min) = geometry[0]
    (x_max, y_max) = geometry[1]
    x_min = math.floor(x_min * len_x)
    x_max = math.ceil(x_max * len_x)
    y_min = math.floor(y_min * len_y)
    y_max = math.ceil(y_max * len_y)
    return [x_min, y_min, x_max, y_max]

def get_coordinates(output, page_dim):
    text_coordinates = []
    for obj in output:
        converted_coordinates = geometry_to_bbox([[obj[0], obj[1]],
                                                       [obj[2], obj[3]]],
                                                      page_dim)
        text_coordinates.append(converted_coordinates)
    return text_coordinates

def remove_padding(pages, loc_preds):
    rectified_preds = []
    loc_preds = [loc_preds[0]['words']]
    for page, loc_pred in zip(pages, loc_preds):
        h, w = page.shape[0], page.shape[1]
        if h > w:
            loc_pred[:,
                     [0, 2]] = np.clip((loc_pred[:, [0, 2]] - 0.5) * h / w
                                       + 0.5, 0, 1)
        elif w > h:
            loc_pred[:,
                     [1, 3]] = np.clip((loc_pred[:, [1, 3]] - 0.5) * w / h
                                       + 0.5, 0, 1)
        rectified_preds.append(loc_pred)
    return rectified_preds

image = cv2.imread('image.jpg')
det_res = detector([image])[0]
det_res = remove_padding([image], [det_res])
# Remove extra padding from detected normalised bbox
for res in det_res:
    res[:, 2] = res[:, 2] - 0.005
    res[:, 1] = res[:, 1] + 0.001
boxes = get_coordinates(det_res[0], image.shape[:2])

Feb 29 '24 08:02 ashishupadhyaa

Any update on this issue?

Mar 21 '24 14:03 ashishupadhyaa

Hi @ashishupadhyaa :wave:,

Excuse the late response :sweat_smile: The code itself looks correct but you should pass symmetric_pad=True to the detection_predictor instance, which is in the top level ocr_predictor by default = True

Best regards, Felix

Mar 22 '24 14:03 felixdittrich92

Hey @ashishupadhyaa any updates ? Has it solved your problem ? :)

Apr 16 '24 05:04 felixdittrich92

Hi @felixdittrich92, sorry for the late reply.

It is working with the new version. Thanks for the solution.

May 09 '24 06:05 ashishupadhyaa

Hi @felixdittrich92, Sorry to bother you again but the issue is still there. I am attaching the code as well as file. Can you please help me out with this issue?

import os
import math
import torch
import logging

import numpy as np

from PIL import Image
from huggingface_hub import hf_hub_download
from doctr.datasets.vocabs import VOCABS
from doctr.models import (crnn_vgg16_bn, db_resnet50, recognition_predictor, detection_predictor)

logger = logging.getLogger(__name__)


class OCR:

    def __init__(self):
        device_type = 'cuda' 
        VOCABS["french"] = (
            '0123456789abcdefghijklmnopqrstuvwxyz'
            'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
            '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}'
            '~°£€¥¢฿·ª®×üı₹àâéèêëîïôùûüçÀ Â'
        )
        reco_model_path = hf_hub_download(
            repo_id="repo_id",
            filename="crnn_vgg16.pt",
            use_auth_token='token'
        )
        reco_model = crnn_vgg16_bn(pretrained=False, vocab=VOCABS["french"])
        reco_params = torch.load(
            reco_model_path, map_location=torch.device(device_type)
        )
        reco_model.load_state_dict(reco_params)
        det_model_path = hf_hub_download(
            repo_id="repo_id",
            filename="db_resnet50.pt",
            use_auth_token='token'
        )
        det_model = db_resnet50(pretrained=False)
        det_params = torch.load(
            det_model_path, map_location=torch.device(device_type)
        )
        det_model.load_state_dict(det_params)
        self.recogniser = recognition_predictor(
            reco_model,
            pretrained=True,
            pretrained_backbone=True,
            batch_size=128
        )
        self.detector = detection_predictor(
            det_model,
            pretrained=True,
            pretrained_backbone=True,
            batch_size=2,
            assume_straight_pages=True,
            preserve_aspect_ratio=True,
            symmetric_pad=True,
        )

    def geometry_to_bbox(self, geometry, page_dim):
        len_x = page_dim[1]
        len_y = page_dim[0]
        (x_min, y_min) = geometry[0]
        (x_max, y_max) = geometry[1]
        x_min = math.floor(x_min * len_x)
        x_max = math.ceil(x_max * len_x)
        y_min = math.floor(y_min * len_y)
        y_max = math.ceil(y_max * len_y)
        return [x_min, y_min, x_max, y_max]

    def get_coordinates(self, output, page_dim):
        text_coordinates = []
        for obj in output:
            converted_coordinates = self.geometry_to_bbox([[obj[0], obj[1]],
                                                           [obj[2], obj[3]]],
                                                          page_dim)
            text_coordinates.append(converted_coordinates)
        return text_coordinates

    def remove_padding(self, pages, loc_preds):
        rectified_preds = []
        for page, loc_pred in zip(pages, loc_preds):
            h, w = page.shape[0], page.shape[1]
            if h > w:
                loc_pred[:,
                         [0, 2]] = np.clip((loc_pred[:, [0, 2]] - 0.5) * h / w
                                           + 0.5, 0, 1)
            elif w > h:
                loc_pred[:,
                         [1, 3]] = np.clip((loc_pred[:, [1, 3]] - 0.5) * w / h
                                           + 0.5, 0, 1)
            rectified_preds.append(loc_pred)
        return rectified_preds

    def full_readtext(
        self,
        image
    ):
        def sort_bbox(bbox, img_width):
            tolerance_factor = 20
            return ((bbox[1] // tolerance_factor)
                    * tolerance_factor) * img_width + bbox[0]

        det_res = self.detector([image])[0]['words']
        det_res = self.remove_padding([image], [det_res])
        for res in det_res:
            res[:, 2] = res[:, 2] - 0.005
            res[:, 1] = res[:, 1] + 0.001
        boxes = self.get_coordinates(det_res[0], image.shape[:2])
        boxes.sort(key=lambda x: sort_bbox(x, image.shape[1]))
        rec_batch = [image[box[1]:box[3], box[0]:box[2]] for box in boxes]
        rec_res = self.recogniser(rec_batch)
        words = [res[0] for res in rec_res]
        return words, boxes

and the link to the image file: https://drive.google.com/file/d/173x_6iOIyTqSzwWqF1WyoIKL3F67RCeg/view?usp=sharing

May 12 '24 13:05 ashishupadhyaa

Note - I tried it with the pre-trained model and it is working fine but with the fine-tuned model it is partially detecting the words. So I think the problem is with the fine-tuned model can you guide me on how to train the model properly or can we do something about this without fine-tuning the model?

May 13 '24 05:05 ashishupadhyaa