pdftitle
pdftitle copied to clipboard
improve space detection and remove pdfminer high level code
Text in the PDF file might not contain space character but the space might be indicated with an actual (additional) horizontal position difference between the glyphs before and after the space, so between the last char and the first char of the words. pdfminer has a high level code detecting this i.e. if the space between chars is greater than a certain threshold (possibly specified in the font file). It is better to do this manually and also implement spacing if vertical positions also changed (title in more than one lines). When this is done, I think, the get_title_from_io method can be simplified by removing the TextConverter and PDFPageInterpreter related parts.
Is it currently expected if a headline spans multiple lines it will fail to output the right format (in my case: words on different lines are joined without spaces)?
Figured out what's happening in the current version. I've got PDF titles split over multiple lines, but the lines itself hold spaces, so the statement on line 564 (if " " not in title
) doesn't return True. When forcing this it works in my case. Maybe possibly add an argument to force space correction (or just alway correct spaces)?
Yes it makes sense to add an argument. If possible, can you share the pdf so it can be used to validate this improvement ?
On Sat, 29 Jan 2022 at 18:35, Maarten den Braber @.***> wrote:
Figured out what's happening in the current version. I've got PDF titles split over multiple lines, but the lines itself hold spaces, so the statement on line 564 (if " " not in title) doesn't return True. When forcing this it works in my case. Maybe possibly add an argument to force space correction (or just alway correct spaces)?
— Reply to this email directly, view it on GitHub https://github.com/metebalci/pdftitle/issues/25#issuecomment-1024954150, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGGJB65C7YKQMTI5BHQLRDUYQQOFANCNFSM5B4PVJOQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were assigned.Message ID: @.***>
I've sent the PDF files via e-mail for validation. It works now on some, but not yet on all articles when forcing space correction.