Brandon Smock comments

Results 33 comments of


                                            Brandon Smock

Couldn't download PubTables-1M dataset

Hey everyone, I am looking into this and reaching out to the team that manages the hosting for the data set. If there is no resolution soon I'll find another...

Couldn't download PubTables-1M dataset

Looks like the issue is not getting resolved, so I just uploaded the dataset here: https://huggingface.co/datasets/bsmock/pubtables-1m Try this new link and let me know if you have any issues. Cheers,...

Couldn't download PubTables-1M dataset

Was anyone able to get the data set from the new link?

Table annotations within full page images

Hi, In theory the structure annotations on top of the full-page table detection images should be recoverable from the PDF-Annotations data. However, something to note is that for PubTables-1M, a...

Stuck trying to run main.py. All help gratefully accepted...

Hi, First let me say I understand the ask for better documentation for a broader audience. This repository has been intended mostly for other ML researchers, to allow others to...

Questions regarding the pubmed datasets.

Right now this is just a copy of the original dataset. But soon we will update the test and val splits to version 1.1. This version is what is used...

Annotation Tool

For some context, the format and naming of the fields for the words JSON files originates with the text extraction in PyMuPDF, which for each word gives block_num, line_num, and...

Annotation Tool

Check the newly-created scripts/ folder for code that creates the words JSON files from PDF for datasets where PDFs are available, such as PubTables-1M, FinTabNet, and SciTSR.

Fine-tuning dataset size

@IrfanSk-AI PubTables-1M contains 500k samples (page images) for detection. 500k is enough to train from scratch if your data is high-quality. Could there be a data quality issue? @bely66 1k...

Input for TSR model?

This issue is a duplicate of #21 (and possibly others), but because the colab notebook using the models on HuggingFace is new, it's worth re-addressing. In summary: - The TSR...