pdfquery icon indicating copy to clipboard operation
pdfquery copied to clipboard

How does pdfquery determine the index?

Open SalmonTT opened this issue 6 years ago • 0 comments

Amazon_CF.pdf

Amazon.txt Hi jcushman!

I am a freshman from Hong Kong and currently trying to find a way to read tables from PDF and work with its data.

I tried the following code with the PDF attached and obtained the results stored in the .txt file which I have also attached. pdf = pdfquery.PDFQuery('Amazon_CF.pdf') pdf.load() pdf.tree.write('test.xml', pretty_print=True)

My questions are:

  1. How are the index determined? It appears that the index order does not follow line-by-line order.
  2. Are their any methods to re-arrange the index? Preferably in the order of line-by-line and left-to-right.

Hopefully my explanation is clear enough. Any help would be greatly appreciated!

Cheers, Simon

SalmonTT avatar Jun 13 '18 07:06 SalmonTT