pdfalto
pdfalto copied to clipboard
PDF to XML ALTO file converter
The text is the following: And when the grobid-quantities (in this case) extracts the `5` of the `n=5`, in reality it calculates the bounding boxes including `n=-` as well. When...
This should allow the usage of a plain `xpdf` mirror, without the need of touching the `CMakeLists.txt` file. https://cmake.org/cmake/help/latest/command/add_subdirectory.html#command:add_subdirectory ```cmake add_subdirectory(source_dir [binary_dir] [EXCLUDE_FROM_ALL]) ``` - Use automatic xpdf mirror repo...
Currently the following styles seem to be supported by pdfalto: - italic - bold - subscript - superscript Some other styles we would be interested in: - underline - sc...
I used **Clang 6.0 and AddressSanitizer** to build **[pdfalto](https://github.com/kermitt2/pdfalto)**, this [file](https://github.com/grandnew/software-vulnerabilities/blob/master/pdfalto/infinite_loop) can infinite loop when executing this command: ```shell ./pdfalto infinite_loop 1.xml ```
Previously, we used pdfalto to generate an ALTO XML from the pdf and https://github.com/filak/hOCR-to-ALTO to convert the ALTO XML to hOCR file after that. With the newest release of pdfalto...
The option -cutPages does not seem to be working: without that option, all textlines are extracted ``` pdfalto -blocks input.pdf ``` but with the option enabled.. ``` pdfalto -blocks -cutPages...
Hi, I found a crash in (the latest commit `8296a3d` on master). PoC: https://github.com/strongcourage/PoCs/blob/master/pdfalto_8296a3d/PoC_segv_TextPage::createPath Command: pdfalto $PoC /dev/null ASAN says: ~~~ ==17560==ERROR: AddressSanitizer: SEGV on unknown address 0x02007f614fef (pc 0x00000073e2e2...
Hi, I found a UAF bug in (the latest commit `8296a3d` on master). PoC: https://github.com/strongcourage/PoCs/blob/master/pdfalto_8296a3d/PoC_uaf_TextPage::createPath Command: pdfalto $PoC /dev/null ASAN says: ~~~ ==12326==ERROR: AddressSanitizer: heap-use-after-free on address 0x602000036418 at pc...
Compiling with gnu 5.4.0, got this warning: [ 15%] Building CXX object xpdf-4.00/xpdf/CMakeFiles/xpdf.dir/GlobalParams.cc.o In file included from /home/roger/higgins-crowley/pdfalto/pdfalto/xpdf-4.00/xpdf/GlobalParams.cc:64:0: /home/roger/higgins-crowley/pdfalto/pdfalto/xpdf-4.00/xpdf/UnicodeToUnicodeFontRules.h:29:1: warning: ISO C++ forbids converting a string constant to ‘char*’ [-Wwrite-strings]...
When I use pdfalto to parse Chinese PDF articles, some articles cannot be parsed, and an error "Syntax Error (305883): No font in show" is reported. Command : ./pdfalto pdf_file...