table-transformer
table-transformer copied to clipboard
Running model on PDFs and Generating tokens for words.json from PDF
I am generating tokens for Table detection model from PDF using following script.
And then converting pdf to images. However I am getting an error while running inference pipeline.
Please help with the issue. Also if anyone can share there script to run model on pdfs, it will be great ! Thanks is advance!
I ran into a similar problem and fixed it. In my case, it was because Rect()
expects four params:
Rect(x0, y0, x1, y1) where (x0, y0) is the bottom-left and (x1, y1) is the top-right.
I'm using easyocr
. To represent a bounding box, it returns a list of four coordinate-pairs. So I had to grab the two relevant coordinates.
I also ran into an issue serializing int64s. Here's my OCR code:
import json
import easyocr
reader = easyocr.Reader(['en'])
result = reader.readtext('path/to/image.jpg')
words = []
for _, word in enumerate(result):
bbox_raw = word[0]
bbox = [bbox_raw[0][0], bbox_raw[0][1], bbox_raw[2][0], bbox_raw[2][1]]
text = word[1]
words.append({"text": text, "bbox": bbox})
with open("path/to/image_words.json", "w") as file:
json.dump(words, file, default=int)
I ran as follows:
python inference.py --image_dir ../path/to/img/ --words_dir ../path/to/words/ --out_dir ../results --structure_config_path ./structure_config.json --structure_model_path ./TATR-v1.1-Pub-msft.pth --mode extract --csv --detection_config_path ./detection_config.json --detection_model_path ./pubtables1m_detection_detr_r18.pth --visualize
I would advise to use pdftools which is available in R. This library can be used in python. The pdftools are much more accurate when it comes to pdf manipulation
I ran into a similar problem and fixed it. In my case, it was because
Rect()
expects four params: Rect(x0, y0, x1, y1) where (x0, y0) is the bottom-left and (x1, y1) is the top-right.I'm using
easyocr
. To represent a bounding box, it returns a list of four coordinate-pairs. So I had to grab the two relevant coordinates.I also ran into an issue serializing int64s. Here's my OCR code:
import json import easyocr reader = easyocr.Reader(['en']) result = reader.readtext('path/to/image.jpg') words = [] for _, word in enumerate(result): bbox_raw = word[0] bbox = [bbox_raw[0][0], bbox_raw[0][1], bbox_raw[2][0], bbox_raw[2][1]] text = word[1] words.append({"text": text, "bbox": bbox}) with open("path/to/image_words.json", "w") as file: json.dump(words, file, default=int)
I ran as follows:
python inference.py --image_dir ../path/to/img/ --words_dir ../path/to/words/ --out_dir ../results --structure_config_path ./structure_config.json --structure_model_path ./TATR-v1.1-Pub-msft.pth --mode extract --csv --detection_config_path ./detection_config.json --detection_model_path ./pubtables1m_detection_detr_r18.pth --visualize
Hey thanks for the reply, but I am looking to read text from pdf only not from images
you can convert the PDF pages to images.