pdfalto icon indicating copy to clipboard operation
pdfalto copied to clipboard

PDF to XML ALTO file converter

Results 85 pdfalto issues
Sort by recently updated
recently updated
newest added

For the example `471433v1` (from bioRxiv 10k training dataset), there is an image that is extracted with a blank overlay, with the same coordinates as the true image figure. In...

hi, I find bug, `heap-buffer-overflow` in 499ed89 gcc --version: ``` ➜ 1 git:(master) ✗ gcc --version gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free...

Had some issues with text in Franklin Gothic that was bolded but not being marked as such. The font name was "franklingothic-heavy". I resolved this by checking for "heavy" as...

By default pdfalto extracts both embedded bitmaps and vector graphics. The option -noImage avoids extracting both graphics types. However we might want still the vector graphics extracted and not the...

enhancement

Hi, I don't know if this issue should be reported here or in the GROBID repo. I discovered it while experimenting with GROBID. When [parsing this PDF](https://dl.acm.org/doi/pdf/10.1145/3411764.3445436) (and a few...

We're currently testing pdfalto. Specifically, we're converting a lot of PDFs to HTML via the XML output of pdfalto (as we were not quite satisfied with the result of any...

As we discussed in the past, it seems that results from pdfalto are not consistent on macOs. The problem does not occur in Linux anymore, it's still happening on macOs....

I want to generate annotated image files to train OCR. ``` wget https://ia800902.us.archive.org/14/items/arxiv-0704.0646/0704.0646.pdf pdfalto 0704.0646.pdf 0704.0646.xml ``` The generated alto file shows page WIDTH and HEIGHT is 612 and 792....

enhancement

The input is a large PDF (2772 pages). The output seems OK, but while trying to extract section headers, I noticed that sometimes String elements where missing. However if I...

The following is an svg example which is produced by pdfalto 0.3. ``` ...... ``` The svg M or L path command needs both x and y coordinates, but only...

bug
implemented