LinguaCafe icon indicating copy to clipboard operation
LinguaCafe copied to clipboard

fix: bind punctuation to following word when needed

Open cblanken opened this issue 5 months ago • 4 comments

This PR fixes issue #328.

The existing punctuation spacing implementation uses a spaceAfter attribute to force a space (right padding) after punctuation from the tokens_with_no_space_before config array. However, for many punctuation marks this reduces readability since punctuation (like quotation marks) may appear to be bound to the previous word when it should be bound to the next word.

For example:

Tommy said, "Hello there". would be incorrectly rendered like this Tommy said," Hello there".

This fix propagates the is_punct and whitespace_ attributes from the spacy tokenization and a boolean (space_before) for tokens with preceding whitespace. They're all used to more accurately arrange the space surrounding punctuation.

@simjanos-dev I made some progress on this, but I've only tested it with German so far. It seems to work well, but I was wondering if you had a particular procedure or some sample texts you used to test? I want to make sure my changes didn't break rendering for the other languages.

New rendering

image

Previous rendering

image

cblanken avatar Sep 23 '24 23:09 cblanken