pdf-reader icon indicating copy to clipboard operation
pdf-reader copied to clipboard

Unable to extract text from pdf/a (with flat decode)

Open celsowm opened this issue 5 years ago • 1 comments

Hi ! I have tried this pdf, with this code:


require 'rubygems'
require 'pdf/reader'

filename = "pdfa.pdf"

PDF::Reader.open(filename) do |reader|
  reader.pages.each do |page|
    puts page.text
  end
end

But the result was something like:

                
    
                                                                                                                 
                                                                                  
                
  

                                                                                                                                  
                      

Is there any way to extract text from it?

celsowm avatar May 21 '20 18:05 celsowm

I get the same results when trying to extract text using pdf-reader.

I also tried extracting text with pdftotext (which uses libpoppler), and firefox (which uses pdf.js). Neither of them worked either.

I haven't checked the PDF contents in detail, but I'm if poppler and pdf.js have trouble then I suspect it's a broken file.

yob avatar May 24 '20 07:05 yob