pdfminer.six Out-of-order coordinates

I just simply called the method in pdfminer to parse a pdf, but there is a problem with the coordinates of the parsed result, which is different from what I expected, sometimes the coordinates will be high, sometimes it will be low，but pdfJs can solve this problem region_1_0 I've made some modifications that fix this Tm_mul_CTM = matrix Th = scaling Tfs = fontsize _render_matrix = (Tfs * Th, 0, # 0 0, Tfs, # 0 0, rise # 1 ) Trm = mult_matrix(_render_matrix, Tm_mul_CTM) (a, b, c, d, e, f) = Trm w, h = x1 - x0, y1 - y0 (x0, y0) = (e, f) (x1, y1) = (x0 + w, y0 + h) y0, y1 = y0 + descent, y1 + descent

Feb 23 '24 05:02 zhangtingyun

I don't know if my change is correct, please let me know or can you fix this bug, thanks!

Feb 23 '24 05:02 zhangtingyun

page_1.pdf this is the pdf

Feb 23 '24 05:02 zhangtingyun

a227a8e0-d20d-11ee-9186-1f397a94c388

Feb 23 '24 05:02 zhangtingyun

Hi, I'm also facing the same issue while using pdfplumber which is developed base on pdfminer.six. In my usage, the pdfminer.six version is 20221105, pdfplumber version is 0.10.4 Even though I've tried repaired PDFs with ghostscripts, the as follow:

gswin64c -o repaired.pdf -sDEVICE=pdfwrite input.pdf

output file is repaired.pdf Reference : https://github.com/jsvine/pdfplumber/issues/425

The repaired.pdf is still out of order while extracting text.

https://github.com/pdfminer/pdfminer.six/blob/ebf7bcdb983f36d0ff5b40e4f23b52525cb28f18/pdfminer/layout.py#L375

And I tried to remove * fontsize in descent The result goes correct

Is this a bug or something? Thanks

Mar 19 '24 09:03 Han860207

https://github.com/pdfminer/pdfminer.six/issues/948#issuecomment-2006396235

Is it possible to follow the above method?

Jul 22 '24 06:07 hl-gl

According to the above method, the problem is not solved

Jul 22 '24 07:07 hl-gl

pdfminer.six pdfminer.six copied to clipboard

Out-of-order coordinates

pdfminer.six
pdfminer.six copied to clipboard