pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

Extracting table with vertical texts give unreadable result

Open Dragon2fly opened this issue 11 months ago • 9 comments

Describe the bug

Table extraction with vertical header texts returned unreadable string or reverted order.

Have you tried repairing the PDF?

Yes. The problem is still there

Code to reproduce the problem

import pdfplumber

pdf = pdfplumber.open(r"tests\pdf_samples\camelot\agstat.pdf", repair=True)
p0 = pdf.pages[0]
# im = p0.to_image()
# im.debug_tablefinder()
# im.show()
table = p0.extract_table()
for line in table:
    print(line)

PDF file

agstat.pdf

Expected behavior

The vertical text in the red box should be extracted correctly.

image

Actual behavior

It returned unreadable text for the first row:

['Sl.\nNo.', 'District', 'noitalupoP\n31-2102\n)shkal\ndetcejorP\nnI(\nrof', '%88\not )shkal\ntludA\ntnelaviuqE\nnI(', ')yad/tluda/smg004\nnoitpmusnoC\n)sennot\ntnemeriuqer\nhkaL\nlatoT\nnI(\n@(', 'tnemeriuqeR ,sdees )egatsaw )sennot\ngnidulcnI(\nhkaL\n&\nsdeef\nlatoT\nnI(', 'Production (Rice)\n(In Lakh tonnes)', None, None, 'Surplus/Defi cit\n(In Lakh\ntonnes)', None]

And returned reversed text of the second row

[None, None, None, None, None, None, 'firahK', 'ibaR', 'latoT', 'eciR', 'yddaP']

Screenshots

The table outline is still detected correctly

image

Environment

  • pdfplumber version: 0.10.1
  • Python version: [e.g., 3.10]
  • OS: Windows 10

Dragon2fly avatar Jul 22 '23 02:07 Dragon2fly