pdfquery
pdfquery copied to clipboard
How does pdfquery determine the index?
Amazon.txt Hi jcushman!
I am a freshman from Hong Kong and currently trying to find a way to read tables from PDF and work with its data.
I tried the following code with the PDF attached and obtained the results stored in the .txt file which I have also attached. pdf = pdfquery.PDFQuery('Amazon_CF.pdf') pdf.load() pdf.tree.write('test.xml', pretty_print=True)
My questions are:
- How are the index determined? It appears that the index order does not follow line-by-line order.
- Are their any methods to re-arrange the index? Preferably in the order of line-by-line and left-to-right.
Hopefully my explanation is clear enough. Any help would be greatly appreciated!
Cheers, Simon