table-transformer Running model on PDFs and Generating tokens for words.json from PDF

I am generating tokens for Table detection model from PDF using following script.

And then converting pdf to images. However I am getting an error while running inference pipeline.

Please help with the issue. Also if anyone can share there script to run model on pdfs, it will be great ! Thanks is advance!

Sep 20 '23 04:09 Nikhilsonawane07

I ran into a similar problem and fixed it. In my case, it was because Rect() expects four params: Rect(x0, y0, x1, y1) where (x0, y0) is the bottom-left and (x1, y1) is the top-right.

I'm using easyocr. To represent a bounding box, it returns a list of four coordinate-pairs. So I had to grab the two relevant coordinates.

I also ran into an issue serializing int64s. Here's my OCR code:

import json
import easyocr

reader = easyocr.Reader(['en'])
result = reader.readtext('path/to/image.jpg')
words = []

for _, word in enumerate(result):
    bbox_raw = word[0]
    bbox = [bbox_raw[0][0], bbox_raw[0][1], bbox_raw[2][0], bbox_raw[2][1]]
    text = word[1]
    words.append({"text": text, "bbox": bbox})

with open("path/to/image_words.json", "w") as file:
    json.dump(words, file, default=int)

I ran as follows:

python inference.py --image_dir ../path/to/img/ --words_dir ../path/to/words/ --out_dir ../results --structure_config_path ./structure_config.json --structure_model_path ./TATR-v1.1-Pub-msft.pth --mode extract --csv --detection_config_path ./detection_config.json --detection_model_path ./pubtables1m_detection_detr_r18.pth --visualize

Oct 01 '23 07:10 aostiles

I would advise to use pdftools which is available in R. This library can be used in python. The pdftools are much more accurate when it comes to pdf manipulation

Oct 06 '23 06:10 linkstatic12

I ran into a similar problem and fixed it. In my case, it was because Rect() expects four params: Rect(x0, y0, x1, y1) where (x0, y0) is the bottom-left and (x1, y1) is the top-right.

I'm using easyocr. To represent a bounding box, it returns a list of four coordinate-pairs. So I had to grab the two relevant coordinates.

I also ran into an issue serializing int64s. Here's my OCR code:
import json
import easyocr

reader = easyocr.Reader(['en'])
result = reader.readtext('path/to/image.jpg')
words = []

for _, word in enumerate(result):
    bbox_raw = word[0]
    bbox = [bbox_raw[0][0], bbox_raw[0][1], bbox_raw[2][0], bbox_raw[2][1]]
    text = word[1]
    words.append({"text": text, "bbox": bbox})

with open("path/to/image_words.json", "w") as file:
    json.dump(words, file, default=int)
I ran as follows:
python inference.py --image_dir ../path/to/img/ --words_dir ../path/to/words/ --out_dir ../results --structure_config_path ./structure_config.json --structure_model_path ./TATR-v1.1-Pub-msft.pth --mode extract --csv --detection_config_path ./detection_config.json --detection_model_path ./pubtables1m_detection_detr_r18.pth --visualize

Hey thanks for the reply, but I am looking to read text from pdf only not from images

Oct 06 '23 07:10 Nikhilsonawane07

you can convert the PDF pages to images.

Oct 06 '23 09:10 linkstatic12

table-transformer table-transformer copied to clipboard

Running model on PDFs and Generating tokens for words.json from PDF

table-transformer
table-transformer copied to clipboard