Partial detection in the images which has more width than height.
Bug description
I have tried the detection model on the images which have more width than height but in all of the images, the model is not able to detect the words that are present on the right side of the image. For reference:
This issue is persistent in most of the images which have longer widths than height.
Code snippet to reproduce the bug
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
model = ocr_predictor(pretrained=True)
doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
result = model(doc)
the bounding box in result variable is not only for the partial image.
Error traceback
None
Environment
python - 3.8.10 torch - 2.0.0+cu117 doctr - 0.6.0
Deep Learning backend
is_tf_available: False is_torch_available: True
Hi @ashishupadhyaa 👋, Please update doctr to 0.7.0 or higher (0.8.0 release will be this week) :)
Hi @ashishupadhyaa 👋, Please update doctr to 0.7.0 or higher (0.8.0 release will be this week) :)
Thanks for the reply @felixdittrich92 I have checked it with the 0.7.0 version and the issue is still there.
@ashishupadhyaa I tried your attached image with the latest dev version (0.8.0a) and it works could you attach the original image maybe ?
@felixdittrich92 Sorry I won't be able to post the original image but I can wait until you release the new release as it will be released this week itself. :blush:
Released :)
Hi @felixdittrich92
I have separated the detection and recognition models as I have to do some post-processing on the cropped images. I have a fine-tuned db_resnet model for detecting the words and it is giving me partial detection only in the 0.8.0 version as well but the pre-trained models provided in the repo are detecting it correctly. Can you tell me if is it because of the training or because I am leaving out some post-processing behind when converting the output of the detection model to proper bounding boxes?
I am attaching the reference code below:
from doctr.models import (
crnn_vgg16_bn, db_resnet50, recognition_predictor, detection_predictor
)
device_type = 'cuda'
det_model_path = 'db_resnet50_20231005-072017.pt'
det_model = db_resnet50(pretrained=False)
det_params = torch.load(
det_model_path, map_location=torch.device(device_type)
)
det_model.load_state_dict(det_params)
def geometry_to_bbox(geometry, page_dim):
len_x = page_dim[1]
len_y = page_dim[0]
(x_min, y_min) = geometry[0]
(x_max, y_max) = geometry[1]
x_min = math.floor(x_min * len_x)
x_max = math.ceil(x_max * len_x)
y_min = math.floor(y_min * len_y)
y_max = math.ceil(y_max * len_y)
return [x_min, y_min, x_max, y_max]
def get_coordinates(output, page_dim):
text_coordinates = []
for obj in output:
converted_coordinates = geometry_to_bbox([[obj[0], obj[1]],
[obj[2], obj[3]]],
page_dim)
text_coordinates.append(converted_coordinates)
return text_coordinates
def remove_padding(pages, loc_preds):
rectified_preds = []
loc_preds = [loc_preds[0]['words']]
for page, loc_pred in zip(pages, loc_preds):
h, w = page.shape[0], page.shape[1]
if h > w:
loc_pred[:,
[0, 2]] = np.clip((loc_pred[:, [0, 2]] - 0.5) * h / w
+ 0.5, 0, 1)
elif w > h:
loc_pred[:,
[1, 3]] = np.clip((loc_pred[:, [1, 3]] - 0.5) * w / h
+ 0.5, 0, 1)
rectified_preds.append(loc_pred)
return rectified_preds
image = cv2.imread('image.jpg')
det_res = detector([image])[0]
det_res = remove_padding([image], [det_res])
# Remove extra padding from detected normalised bbox
for res in det_res:
res[:, 2] = res[:, 2] - 0.005
res[:, 1] = res[:, 1] + 0.001
boxes = get_coordinates(det_res[0], image.shape[:2])
Any update on this issue?
Hi @ashishupadhyaa :wave:,
Excuse the late response :sweat_smile:
The code itself looks correct but you should pass symmetric_pad=True to the detection_predictor instance, which is in the top level ocr_predictor by default = True
Best regards, Felix
Hey @ashishupadhyaa any updates ? Has it solved your problem ? :)
Hi @felixdittrich92, sorry for the late reply.
It is working with the new version. Thanks for the solution.
Hi @felixdittrich92, Sorry to bother you again but the issue is still there. I am attaching the code as well as file. Can you please help me out with this issue?
import os
import math
import torch
import logging
import numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download
from doctr.datasets.vocabs import VOCABS
from doctr.models import (crnn_vgg16_bn, db_resnet50, recognition_predictor, detection_predictor)
logger = logging.getLogger(__name__)
class OCR:
def __init__(self):
device_type = 'cuda'
VOCABS["french"] = (
'0123456789abcdefghijklmnopqrstuvwxyz'
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}'
'~°£€¥¢฿·ª®×üı₹àâéèêëîïôùûüçÀ Â'
)
reco_model_path = hf_hub_download(
repo_id="repo_id",
filename="crnn_vgg16.pt",
use_auth_token='token'
)
reco_model = crnn_vgg16_bn(pretrained=False, vocab=VOCABS["french"])
reco_params = torch.load(
reco_model_path, map_location=torch.device(device_type)
)
reco_model.load_state_dict(reco_params)
det_model_path = hf_hub_download(
repo_id="repo_id",
filename="db_resnet50.pt",
use_auth_token='token'
)
det_model = db_resnet50(pretrained=False)
det_params = torch.load(
det_model_path, map_location=torch.device(device_type)
)
det_model.load_state_dict(det_params)
self.recogniser = recognition_predictor(
reco_model,
pretrained=True,
pretrained_backbone=True,
batch_size=128
)
self.detector = detection_predictor(
det_model,
pretrained=True,
pretrained_backbone=True,
batch_size=2,
assume_straight_pages=True,
preserve_aspect_ratio=True,
symmetric_pad=True,
)
def geometry_to_bbox(self, geometry, page_dim):
len_x = page_dim[1]
len_y = page_dim[0]
(x_min, y_min) = geometry[0]
(x_max, y_max) = geometry[1]
x_min = math.floor(x_min * len_x)
x_max = math.ceil(x_max * len_x)
y_min = math.floor(y_min * len_y)
y_max = math.ceil(y_max * len_y)
return [x_min, y_min, x_max, y_max]
def get_coordinates(self, output, page_dim):
text_coordinates = []
for obj in output:
converted_coordinates = self.geometry_to_bbox([[obj[0], obj[1]],
[obj[2], obj[3]]],
page_dim)
text_coordinates.append(converted_coordinates)
return text_coordinates
def remove_padding(self, pages, loc_preds):
rectified_preds = []
for page, loc_pred in zip(pages, loc_preds):
h, w = page.shape[0], page.shape[1]
if h > w:
loc_pred[:,
[0, 2]] = np.clip((loc_pred[:, [0, 2]] - 0.5) * h / w
+ 0.5, 0, 1)
elif w > h:
loc_pred[:,
[1, 3]] = np.clip((loc_pred[:, [1, 3]] - 0.5) * w / h
+ 0.5, 0, 1)
rectified_preds.append(loc_pred)
return rectified_preds
def full_readtext(
self,
image
):
def sort_bbox(bbox, img_width):
tolerance_factor = 20
return ((bbox[1] // tolerance_factor)
* tolerance_factor) * img_width + bbox[0]
det_res = self.detector([image])[0]['words']
det_res = self.remove_padding([image], [det_res])
for res in det_res:
res[:, 2] = res[:, 2] - 0.005
res[:, 1] = res[:, 1] + 0.001
boxes = self.get_coordinates(det_res[0], image.shape[:2])
boxes.sort(key=lambda x: sort_bbox(x, image.shape[1]))
rec_batch = [image[box[1]:box[3], box[0]:box[2]] for box in boxes]
rec_res = self.recogniser(rec_batch)
words = [res[0] for res in rec_res]
return words, boxes
and the link to the image file: https://drive.google.com/file/d/173x_6iOIyTqSzwWqF1WyoIKL3F67RCeg/view?usp=sharing
Note - I tried it with the pre-trained model and it is working fine but with the fine-tuned model it is partially detecting the words. So I think the problem is with the fine-tuned model can you guide me on how to train the model properly or can we do something about this without fine-tuning the model?