Brandon Smock

Results 33 comments of Brandon Smock

Hey everyone, I am looking into this and reaching out to the team that manages the hosting for the data set. If there is no resolution soon I'll find another...

Looks like the issue is not getting resolved, so I just uploaded the dataset here: https://huggingface.co/datasets/bsmock/pubtables-1m Try this new link and let me know if you have any issues. Cheers,...

Was anyone able to get the data set from the new link?

Hi, In theory the structure annotations on top of the full-page table detection images should be recoverable from the PDF-Annotations data. However, something to note is that for PubTables-1M, a...

Hi, First let me say I understand the ask for better documentation for a broader audience. This repository has been intended mostly for other ML researchers, to allow others to...

Right now this is just a copy of the original dataset. But soon we will update the test and val splits to version 1.1. This version is what is used...

For some context, the format and naming of the fields for the words JSON files originates with the text extraction in PyMuPDF, which for each word gives block_num, line_num, and...

Check the newly-created scripts/ folder for code that creates the words JSON files from PDF for datasets where PDFs are available, such as PubTables-1M, FinTabNet, and SciTSR.

@IrfanSk-AI PubTables-1M contains 500k samples (page images) for detection. 500k is enough to train from scratch if your data is high-quality. Could there be a data quality issue? @bely66 1k...

This issue is a duplicate of #21 (and possibly others), but because the colab notebook using the models on HuggingFace is new, it's worth re-addressing. In summary: - The TSR...