pdftojson
pdftojson copied to clipboard
using XPDF, pdftojson extracts text from PDF files as JSON, including word bounding boxes.
pdftojson
using XPDF, pdftojson extracts text from PDF files as JSON, including word bounding boxes.
Compile
./configure
make
On MacOS, you might need to specify libpng and libfreetype locations, e.g.
./configure --with-libpng-library=/usr/local/Cellar/libpng/1.6.16/lib/ --with-libpng-includes=/usr/local/Cellar/libpng/1.6.16/include/ --with-freetype2-library=/usr/local/lib/ --with-freetype2-includes=/usr/local/include/freetype2/
You will find pdftojson inside the directory xpdf/pdftojson
Usage
pdftojson <input.pdf> <output.json>
File format
The JSON produced looks like: [ { "pages":14, "number":1, "width":612, "height":792, "text":[ [115,162,41,14,0,"What "], ... ] }, { "pages":14, "number":2, "width":612, "height":792, "text":[ [115,162,41,14,0,"Here "], ... ] }, ... ];
For each page, the text array contains: [top,left,width,height,0,text]