pdfplumber
pdfplumber copied to clipboard
Extracting table with vertical texts give unreadable result
Describe the bug
Table extraction with vertical header texts returned unreadable string or reverted order.
Have you tried repairing the PDF?
Yes. The problem is still there
Code to reproduce the problem
import pdfplumber
pdf = pdfplumber.open(r"tests\pdf_samples\camelot\agstat.pdf", repair=True)
p0 = pdf.pages[0]
# im = p0.to_image()
# im.debug_tablefinder()
# im.show()
table = p0.extract_table()
for line in table:
print(line)
PDF file
Expected behavior
The vertical text in the red box should be extracted correctly.
Actual behavior
It returned unreadable text for the first row:
['Sl.\nNo.', 'District', 'noitalupoP\n31-2102\n)shkal\ndetcejorP\nnI(\nrof', '%88\not )shkal\ntludA\ntnelaviuqE\nnI(', ')yad/tluda/smg004\nnoitpmusnoC\n)sennot\ntnemeriuqer\nhkaL\nlatoT\nnI(\n@(', 'tnemeriuqeR ,sdees )egatsaw )sennot\ngnidulcnI(\nhkaL\n&\nsdeef\nlatoT\nnI(', 'Production (Rice)\n(In Lakh tonnes)', None, None, 'Surplus/Defi cit\n(In Lakh\ntonnes)', None]
And returned reversed text of the second row
[None, None, None, None, None, None, 'firahK', 'ibaR', 'latoT', 'eciR', 'yddaP']
Screenshots
The table outline is still detected correctly
Environment
- pdfplumber version: 0.10.1
- Python version: [e.g., 3.10]
- OS: Windows 10