LinguaCafe
LinguaCafe copied to clipboard
fix: bind punctuation to following word when needed
This PR fixes issue #328.
The existing punctuation spacing implementation uses a spaceAfter
attribute to force a space (right padding) after punctuation from the tokens_with_no_space_before
config array. However, for many punctuation marks this reduces readability since punctuation (like quotation marks) may appear to be bound to the previous word when it should be bound to the next word.
For example:
Tommy said, "Hello there". would be incorrectly rendered like this Tommy said," Hello there".
This fix propagates the is_punct
and whitespace_
attributes from the spacy tokenization and a boolean (space_before
) for tokens with preceding whitespace. They're all used to more accurately arrange the space surrounding punctuation.
@simjanos-dev I made some progress on this, but I've only tested it with German so far. It seems to work well, but I was wondering if you had a particular procedure or some sample texts you used to test? I want to make sure my changes didn't break rendering for the other languages.