table-transformer icon indicating copy to clipboard operation
table-transformer copied to clipboard

Running model on PDFs and Generating tokens for words.json from PDF

Open Nikhilsonawane07 opened this issue 1 year ago • 4 comments

I am generating tokens for Table detection model from PDF using following script.

image

And then converting pdf to images. However I am getting an error while running inference pipeline.

image

Please help with the issue. Also if anyone can share there script to run model on pdfs, it will be great ! Thanks is advance!

Nikhilsonawane07 avatar Sep 20 '23 04:09 Nikhilsonawane07

I ran into a similar problem and fixed it. In my case, it was because Rect() expects four params: Rect(x0, y0, x1, y1) where (x0, y0) is the bottom-left and (x1, y1) is the top-right.

I'm using easyocr. To represent a bounding box, it returns a list of four coordinate-pairs. So I had to grab the two relevant coordinates.

I also ran into an issue serializing int64s. Here's my OCR code:

import json
import easyocr

reader = easyocr.Reader(['en'])
result = reader.readtext('path/to/image.jpg')
words = []

for _, word in enumerate(result):
    bbox_raw = word[0]
    bbox = [bbox_raw[0][0], bbox_raw[0][1], bbox_raw[2][0], bbox_raw[2][1]]
    text = word[1]
    words.append({"text": text, "bbox": bbox})

with open("path/to/image_words.json", "w") as file:
    json.dump(words, file, default=int)

I ran as follows:

python inference.py --image_dir ../path/to/img/ --words_dir ../path/to/words/ --out_dir ../results --structure_config_path ./structure_config.json --structure_model_path ./TATR-v1.1-Pub-msft.pth --mode extract --csv --detection_config_path ./detection_config.json --detection_model_path ./pubtables1m_detection_detr_r18.pth --visualize

aostiles avatar Oct 01 '23 07:10 aostiles

I would advise to use pdftools which is available in R. This library can be used in python. The pdftools are much more accurate when it comes to pdf manipulation

linkstatic12 avatar Oct 06 '23 06:10 linkstatic12

I ran into a similar problem and fixed it. In my case, it was because Rect() expects four params: Rect(x0, y0, x1, y1) where (x0, y0) is the bottom-left and (x1, y1) is the top-right.

I'm using easyocr. To represent a bounding box, it returns a list of four coordinate-pairs. So I had to grab the two relevant coordinates.

I also ran into an issue serializing int64s. Here's my OCR code:

import json
import easyocr

reader = easyocr.Reader(['en'])
result = reader.readtext('path/to/image.jpg')
words = []

for _, word in enumerate(result):
    bbox_raw = word[0]
    bbox = [bbox_raw[0][0], bbox_raw[0][1], bbox_raw[2][0], bbox_raw[2][1]]
    text = word[1]
    words.append({"text": text, "bbox": bbox})

with open("path/to/image_words.json", "w") as file:
    json.dump(words, file, default=int)

I ran as follows:

python inference.py --image_dir ../path/to/img/ --words_dir ../path/to/words/ --out_dir ../results --structure_config_path ./structure_config.json --structure_model_path ./TATR-v1.1-Pub-msft.pth --mode extract --csv --detection_config_path ./detection_config.json --detection_model_path ./pubtables1m_detection_detr_r18.pth --visualize

Hey thanks for the reply, but I am looking to read text from pdf only not from images

Nikhilsonawane07 avatar Oct 06 '23 07:10 Nikhilsonawane07

you can convert the PDF pages to images.

linkstatic12 avatar Oct 06 '23 09:10 linkstatic12