DocBank icon indicating copy to clipboard operation
DocBank copied to clipboard

pdf files not included in the dataset

Open Apurv3377 opened this issue 3 years ago • 6 comments

I have been working on DocBank_samples since a month now. Today I downloaded the main dataset from onedrive and I could not see any pdf files! I wanted to request , If it is possible to provide the PDF files too?

I appreciate the help!

Apurv3377 avatar Jun 06 '21 09:06 Apurv3377

I also wonder why no PDFs are provided. Providing only images + annotations puts restrictions on what kind of models can be build. What if I want a model that just takes PDFs and not just images?

iiLaurens avatar Jun 12 '21 09:06 iiLaurens

Actually, I am also fine if only non-colored PDFs are provided. :) But Yes would be more helpful with annotated.

Apurv3377 avatar Jun 14 '21 14:06 Apurv3377

In fact, the PDF is derived from arXiv's papers during 2014-2018, but we currently have no plan to provide corresponding pdfs. Sorry about this.

liminghao1630 avatar Jun 15 '21 06:06 liminghao1630

Hello @liminghao1630, thank you for your project. About pdfs, are you planning to publish at least the code for preprocessing them? I mean, if I get a quick look to an original paper corrisponding to your images I see some mismatch between their parts, e.g. images in different pages (and so bound boxes and annotations in general).

andreagemelli avatar Nov 03 '21 13:11 andreagemelli

Does anyone know if the names of the PDFs are available? In that case I guess one could build a pipeline to download them.

mattiasstahre avatar Jan 28 '22 11:01 mattiasstahre

@liminghao1630 It's unfortunate that you do not publish the original files as it prohibits pdf-based models from using this dataset.

@mattiasstahre I think that from looking at names of the .txt and .jpg files in the dataset one can identify the arxiv URL. 1.tar_1401.0098.gz_TachyonPotentialsV12-PRD-enviado_8.txt => https://arxiv.org/abs/1401.0098 (Title: Tachyon potentials from a supersymmetric FRW model) 1.tar_1501.00050.gz_Godoy-Diana_etal_2014_Enzo_Levi_Workshop_4_ori.jpg => https://arxiv.org/abs/1501.00050 (Title: Four-winged flapping flyer in forward flight)

1.tar_1501.00050.gz_Godoy-Diana_etal_2014_Enzo_Levi_Workshop_4_ori.jpg => arxiv.org/abs/1501.00050

The page number is also available from the filename (0 indexed, so 4 is the 5th page).

I might build a crawler for this if I got time.

jfreyberg avatar Apr 20 '23 13:04 jfreyberg