pdfplumber Extracting table with vertical texts give unreadable result

Extracting table with vertical texts give unreadable result

Open Dragon2fly opened this issue 11 months ago • 9 comments

Describe the bug

Table extraction with vertical header texts returned unreadable string or reverted order.

Have you tried repairing the PDF?

Yes. The problem is still there

Code to reproduce the problem

import pdfplumber

pdf = pdfplumber.open(r"tests\pdf_samples\camelot\agstat.pdf", repair=True)
p0 = pdf.pages[0]
# im = p0.to_image()
# im.debug_tablefinder()
# im.show()
table = p0.extract_table()
for line in table:
    print(line)

PDF file

agstat.pdf

Expected behavior

The vertical text in the red box should be extracted correctly.

Actual behavior

It returned unreadable text for the first row:

['Sl.\nNo.', 'District', 'noitalupoP\n31-2102\n)shkal\ndetcejorP\nnI(\nrof', '%88\not )shkal\ntludA\ntnelaviuqE\nnI(', ')yad/tluda/smg004\nnoitpmusnoC\n)sennot\ntnemeriuqer\nhkaL\nlatoT\nnI(\n@(', 'tnemeriuqeR ,sdees )egatsaw )sennot\ngnidulcnI(\nhkaL\n&\nsdeef\nlatoT\nnI(', 'Production (Rice)\n(In Lakh tonnes)', None, None, 'Surplus/Defi cit\n(In Lakh\ntonnes)', None]

And returned reversed text of the second row

[None, None, None, None, None, None, 'firahK', 'ibaR', 'latoT', 'eciR', 'yddaP']

Screenshots

The table outline is still detected correctly

Environment

pdfplumber version: 0.10.1
Python version: [e.g., 3.10]
OS: Windows 10

Jul 22 '23 02:07 Dragon2fly

pdfplumber pdfplumber copied to clipboard

Extracting table with vertical texts give unreadable result

Describe the bug

Have you tried repairing the PDF?

Code to reproduce the problem

PDF file

Expected behavior

Actual behavior

Screenshots

Environment

pdfplumber
pdfplumber copied to clipboard