pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

Out-of-order coordinates

Open zhangtingyun opened this issue 1 year ago • 6 comments

I just simply called the method in pdfminer to parse a pdf, but there is a problem with the coordinates of the parsed result, which is different from what I expected, sometimes the coordinates will be high, sometimes it will be low,but pdfJs can solve this problem region_1_0 I've made some modifications that fix this Tm_mul_CTM = matrix Th = scaling Tfs = fontsize _render_matrix = (Tfs * Th, 0,  # 0                   0, Tfs,  # 0                   0, rise  # 1                   ) Trm = mult_matrix(_render_matrix, Tm_mul_CTM) (a, b, c, d, e, f) = Trm w, h = x1 - x0, y1 - y0 (x0, y0) = (e, f) (x1, y1) = (x0 + w, y0 + h) y0, y1 = y0 + descent, y1 + descent image

zhangtingyun avatar Feb 23 '24 05:02 zhangtingyun

I don't know if my change is correct, please let me know or can you fix this bug, thanks!

zhangtingyun avatar Feb 23 '24 05:02 zhangtingyun

page_1.pdf this is the pdf

zhangtingyun avatar Feb 23 '24 05:02 zhangtingyun

a227a8e0-d20d-11ee-9186-1f397a94c388

zhangtingyun avatar Feb 23 '24 05:02 zhangtingyun

Hi, I'm also facing the same issue while using pdfplumber which is developed base on pdfminer.six. In my usage, the pdfminer.six version is 20221105, pdfplumber version is 0.10.4 Even though I've tried repaired PDFs with ghostscripts, the as follow:

gswin64c -o repaired.pdf -sDEVICE=pdfwrite input.pdf 

output file is repaired.pdf Reference : https://github.com/jsvine/pdfplumber/issues/425

The repaired.pdf is still out of order while extracting text. image

https://github.com/pdfminer/pdfminer.six/blob/ebf7bcdb983f36d0ff5b40e4f23b52525cb28f18/pdfminer/layout.py#L375

And I tried to remove * fontsize in descent image The result goes correct image

Is this a bug or something? Thanks

Han860207 avatar Mar 19 '24 09:03 Han860207

https://github.com/pdfminer/pdfminer.six/issues/948#issuecomment-2006396235

Is it possible to follow the above method?

hl-gl avatar Jul 22 '24 06:07 hl-gl

According to the above method, the problem is not solved

hl-gl avatar Jul 22 '24 07:07 hl-gl