pdfalto icon indicating copy to clipboard operation
pdfalto copied to clipboard

Soft hyphens omitted

Open hallsten opened this issue 1 year ago • 3 comments

First, thanks for a great tool!

I have problems with soft hyphens being omitted: image

Resulting in:

<TextLine HEIGHT="8.3970" HPOS="88.4409" ID="p1_t24" VPOS="235.408" WIDTH="40.9320">
	<String CONTENT="2332022" HEIGHT="8.3970" HPOS="88.4409" ID="p1_w51" STYLEREFS="font4" VPOS="235.408" WIDTH="40.9320"/>
</TextLine>

Is this intentional? Or would it be possible to replace soft hyphens with regular hyphens?

hallsten avatar Aug 10 '22 00:08 hallsten

Hi @hallsten !

Thank you for the issue.

The soft hyphens should not be omitted yes, but normally they should not be visible except at the end of a line. I might be wrong, but your example bitmap looks like including normal hyphen?

Would it be possible to share an error case to work on the problem?

I think the goal would be to have the @CONTENT attribute having a string value with the soft hyphen, but these soft hyphens would not be visible in a text editor.

Replacing the soft hyphen by regular hyphen would really change the string (soft hyphen are just indication where to break a line), so if we really can't manage soft hyphen, I suppose it's better to remove them entirely.

kermitt2 avatar Aug 10 '22 05:08 kermitt2

Thanks for your reply! Here is an example of the problem. soft-hyphens.pdf

pdf2json (https://www.npmjs.com/package/pdf2json) for example parse this string as: 23%C2%AD3%C2%AD2022 and pdftotext would replace the character with <0xad>. Would be great to have some kind of delimiter instead of removing the character, i'm trying to standardize a date and it will be impossible without.

hallsten avatar Aug 10 '22 13:08 hallsten

Thank you for this great tool.

I am looking for a solution to this as well. In my use case it would be better to have the softhyphens all replaced by real/hard hyphens at the exact position. If that would be an option, too just include a parameter that does just that ... some people would be very happy.

Seehafengepard avatar Feb 01 '23 15:02 Seehafengepard