PDFIO.jl icon indicating copy to clipboard operation
PDFIO.jl copied to clipboard

Request for examples in the documentation

Open johannspies opened this issue 6 years ago • 2 comments

I am struggling to figure out how to use this library to read a pdf as text for the purpose of Natural Language Processing as an alternative to

using Taro
Taro.init()
meta, txtdata = Taro.extract(files[1]);

as shown in https://github.com/aviks/nlp-workshop/blob/master/NLP-in-julia.ipynb

Or can I not use this library in stead of Taro (which I cannot compile on Julia 1.0.2)?

johannspies avatar Nov 21 '18 14:11 johannspies

PDFIO is a little low level API than Taro in this respect. It deals with PDF each page separately. So you may need a few extra lines of code. The piece of code you are looking for is the following:

function getPDFText(src, out)
       doc = pdDocOpen(src)
       docinfo = pdDocGetInfo(doc)
       open(out, "w") do io
               npage = pdDocGetPageCount(doc)
               for i=1:npage
                     page = pdDocGetPage(doc, i)
                     pdPageExtractText(io, page)
               end
       end
       pdDocClose(doc)
       return docinfo
end

If you still face any issue or challenges with the code please let us know so that we can try to address those.

The library is kept very flexible for accessing detailed query into PDF objects. A summary level API or samples will definitely help for someone to get some quick tasks done as well. We will keep that in mind to add a few examples and samples in the documentation.

sambitdash avatar Nov 22 '18 07:11 sambitdash

Thanks! That helps.

johannspies avatar Nov 22 '18 08:11 johannspies