table-transformer icon indicating copy to clipboard operation
table-transformer copied to clipboard

Is this the correct way to generate tokens for a new example?

Open lionely opened this issue 1 year ago • 6 comments

I am trying to generate tokens according to the Inference.MD.

Method 1: I got inspiration from this issue to use CRAFT and get bounding boxes are the text. Given the bounding boxes, I would get each of the boxes, crop the area of the image then do text recognition and fill a dictionary with the bounding box and extracted text.

Now with this list of dictionaries, I pass the image and token list to TATR. But I end up with empty cells in the end.

Method 2: I thought, maybe the text bounding boxes don't match with the output from TATR. So I tried running TATR twice, First to get bounding boxes and then to crop the image, then do text recognition to make the token dictionary. Second passing the token dictionary with image to TATR to hopefully generate a csv in the end. But again my output was empty.

Is my logic sound?

Code for method 2 (similar to method 1):

encoding = feature_extractor(image, return_tensors="pt")
with torch.no_grad():
    outputs = tatr_model(**encoding)
target_sizes = [image.size[::-1]]
results = feature_extractor.post_process_object_detection(outputs, threshold=0.6, target_sizes=target_sizes)[0]

tokens = []
for obj in results['boxes']:
    obj = obj.numpy()
    tmp_dict = {
        'bbox': obj,
        'text': ''}
    crop_image = resized_image.crop(box=obj)
    pixel_values = processor(crop_image, return_tensors="pt").pixel_values
    generated_ids = trocr_model.generate(pixel_values)
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    tmp_dict['text'] = generated_text
    tokens.append(tmp_dict)

P.S.

Thank you for this very useful library, really looking forward to the progress of this research.

lionely avatar Jun 26 '23 05:06 lionely

@lionely I think you should either use pdf packages to extract the bounding boxes for text-based PDFs or use an OCR for image-based PDFs. Easyocr is a great OCR that would get the bounding boxes with the texts really easy

thiagodma avatar Jun 28 '23 14:06 thiagodma

@thiagodma Thank you very much for your response. According to this issue reply, we need additional information apart from text and bounding boxes to make a csv. Have you been successful with just this text and bounding boxes? Thank you again.

lionely avatar Jun 29 '23 16:06 lionely

I'm using TATR to get HTML but I think you need the same additional info for CSV. Here's my code to get it working:

import easyocr

reader = easyocr.Reader(['en'])

# img is a numpy array for your RGB image
ocr_result = reader.readtext(img, width_ths=.03)

tokens = []
for i, res in enumerate(ocr_result):
    tokens.append({
        "bbox": list(map(int, [res[0][0][0], res[0][0][1], res[0][2][0], res[0][2][1]])),
        "text": res[1],
        "flags": 0,
        "span_num": i,
        "line_num": 0,
        "block_num": 0
    })

thiagodma avatar Jun 29 '23 16:06 thiagodma

@thiagodma Can you please share the script to extract the bounding boxes for text-based PDFs ?

Nikhilsonawane07 avatar Sep 20 '23 04:09 Nikhilsonawane07

@Nikhilsonawane07 Here's a simplified version of my code.

Obs: the 'dpi' parameter is the dpi that you used to convert your PDF page to an image. You have to inform it here so that the bounding boxes are proportional to the size of the image (witch depends on the dpi).

import fitz
from pathlib import Path

flags = fitz.TEXT_INHIBIT_SPACES & ~fitz.TEXT_PRESERVE_IMAGES
dpi = 100

pdf_path = Path("<< path to your pdf >>")
pdf = fitz.open(stream=pdf_path.read_bytes(), filetype="pdf")
page = pdf[0]  # gets the first page of the PDF

words = page.get_text(option="words", flags=flags)

# converting 'words' to a list of dicts instead of a list of tuples
tokens = []
for word_meta in words:
    tokens.append({
        # times (dpi / 72) is to make sure the bounding boxes are in the same scale as the generated image
        "bbox": list(map(lambda x: int(x * (dpi / 72)), [word_meta[0], word_meta[1], word_meta[2], word_meta[3]])),
        "text": word_meta[4],
        "flags": 0,
        "block_num": word_meta[5],
        "line_num": word_meta[6],
        "span_num": word_meta[7]
    })

thiagodma avatar Sep 20 '23 11:09 thiagodma

Hey @thiagodma Thanks. Its working

Nikhilsonawane07 avatar Oct 06 '23 07:10 Nikhilsonawane07