pdfminer.six
pdfminer.six copied to clipboard
Out-of-order coordinates
I just simply called the method in pdfminer to parse a pdf, but there is a problem with the coordinates of the parsed result, which is different from what I expected, sometimes the coordinates will be high, sometimes it will be low,but pdfJs can solve this problem
I've made some modifications that fix this
Tm_mul_CTM = matrix
Th = scaling
Tfs = fontsize
_render_matrix = (Tfs * Th, 0, # 0
0, Tfs, # 0
0, rise # 1
)
Trm = mult_matrix(_render_matrix, Tm_mul_CTM)
(a, b, c, d, e, f) = Trm
w, h = x1 - x0, y1 - y0
(x0, y0) = (e, f)
(x1, y1) = (x0 + w, y0 + h)
y0, y1 = y0 + descent, y1 + descent
I don't know if my change is correct, please let me know or can you fix this bug, thanks!
page_1.pdf this is the pdf
Hi, I'm also facing the same issue while using pdfplumber which is developed base on pdfminer.six.
In my usage, the pdfminer.six version is 20221105
, pdfplumber version is 0.10.4
Even though I've tried repaired PDFs with ghostscripts, the as follow:
gswin64c -o repaired.pdf -sDEVICE=pdfwrite input.pdf
output file is repaired.pdf Reference : https://github.com/jsvine/pdfplumber/issues/425
The repaired.pdf is still out of order while extracting text.
https://github.com/pdfminer/pdfminer.six/blob/ebf7bcdb983f36d0ff5b40e4f23b52525cb28f18/pdfminer/layout.py#L375
And I tried to remove * fontsize
in descent
The result goes correct
Is this a bug or something? Thanks
https://github.com/pdfminer/pdfminer.six/issues/948#issuecomment-2006396235
Is it possible to follow the above method?
According to the above method, the problem is not solved