WeasyPrint icon indicating copy to clipboard operation
WeasyPrint copied to clipboard

Update the text-transform "capitalize" behaviour

Open VeteraNovis opened this issue 3 years ago • 3 comments

Transform the first typographic letter unit of each word to uppercase without modifying other characters

See issue Kozea#1612

VeteraNovis avatar Aug 07 '22 12:08 VeteraNovis

Hi!

Thanks a lot for your contribution, it’s nice and seems to fix the original bug.

The only concern we have is adding a new external dependency. We’re trying hard to avoid new external libraries, moreover when they are written and C (like regex), because they can cause hard times to users.

It’s probably possible to get the same results using Pango. There’s no obvious solution in Pango’s API (ie. we’ll have to write some code 😀), but we already have the code getting text boundaries in get_next_word_boundaries.

liZe avatar Aug 07 '22 15:08 liZe

Hey mate. That's a good point.

I'll see what I can do to do away with the external dependency, and update the PR.

I've been using WeasyPrint for a while now, so it feels good to be able to contribute.

VeteraNovis avatar Aug 07 '22 16:08 VeteraNovis

Hi!

Thanks a lot for your contribution, it’s nice and seems to fix the original bug.

The only concern we have is adding a new external dependency. We’re trying hard to avoid new external libraries, moreover when they are written and C (like regex), because they can cause hard times to users.

It’s probably possible to get the same results using Pango. There’s no obvious solution in Pango’s API (ie. we’ll have to write some code grinning), but we already have the code getting text boundaries in get_next_word_boundaries.

After doing more research, I don't think implementing this in regex will be feasible without implementing the regex module. The way that the default "re" module handles word boundaries is insufficient to deal with Unicode characters (outside of standard Latin characters).

For that reason, I have opted to utilise the already present unicodedata module to determine the Unicode general category of each character. The first typographic letter unit (any character in the Letter or Number category) in each word can now easily be determined, splitting based on the presence of a character from the Separator category.

This removes the dependency of the external module and requires no additional imported modules. Hopefully this solution is a better fit!

I did like the simplicity of the previous solution, but I agree with, and appreciate, your mindset behind it.

VeteraNovis avatar Aug 10 '22 14:08 VeteraNovis

Thanks a lot!

We’ll merge your PR, clean a couple of things and add tests soon.

liZe avatar Aug 15 '22 09:08 liZe