modular icon indicating copy to clipboard operation
modular copied to clipboard

[stdlib] Add full unicode support for character casing functions

Open mzaks opened this issue 1 year ago • 5 comments

The code I used to generate the lookup tables can be found here https://gist.github.com/mzaks/bbadaeebcf81a5200021af041568b26b

mzaks avatar Sep 18 '24 15:09 mzaks

!sync

JoeLoser avatar Sep 19 '24 13:09 JoeLoser

Hi @mzaks this is great! . Just a bit of a nitpick: I think _unicode.mojo and _unicode_lookups.mojo should be in the utils package

Very good point. Just moved them to utils.

mzaks avatar Sep 19 '24 14:09 mzaks

!sync

JoeLoser avatar Sep 19 '24 14:09 JoeLoser

I tried a performance improvement based on static B+Tree described here: https://en.algorithmica.org/hpc/data-structures/s-tree/

It looks promising, but I think it will be better as a separate PR. @JoeLoser what do you think?

mzaks avatar Sep 21 '24 16:09 mzaks

https://en.algorithmica.org/hpc/data-structures/s-tree/

what an awesome book, I have new reading material for a couple of months...

I tried a performance improvement based on static B+Tree

I think in this case it might or it might not be better. The lists are static and known at compile time yet the binary search is generic. I think you could just hardcode values and ranges here, though I'm not sure how that much branching would affect performance. But there are some quite big ranges is some places e.g.:

alias has_uppercase_mapping = List[UInt32, hint_trivial_type=True](
    ...
    0x007A,  # LATIN SMALL LETTER Z z
    0x00B5,  # MICRO SIGN µ
    0x00E0,  # LATIN SMALL LETTER A WITH GRAVE à
    ...
    0x0586,  # ARMENIAN SMALL LETTER FEH ֆ
    0x10D0,  # GEORGIAN LETTER AN ა
    ...
    0x2D2D,  # GEORGIAN SMALL LETTER AEN ⴭ
    0xA641,  # CYRILLIC SMALL LETTER ZEMLYA ꙁ
    ...
)

And if we are going to nitpick performance, I think ASCII letters should be prioritized and not use a generic algorithm for something that is realistically mostly going to be used for the first X amount of letters (we could use an ASCII optimized version and have another Unicode one, at the cost of checking the first UTF8 byte and the branch that follows)

martinvuyk avatar Sep 21 '24 20:09 martinvuyk

!sync

JoeLoser avatar Oct 16 '24 13:10 JoeLoser

FYI I'm chasing after the issues internally that this PR is hitting, I'll have an update soon.

JoeLoser avatar Nov 18 '24 17:11 JoeLoser

FYI I'm chasing after the issues internally that this PR is hitting, I'll have an update soon.

All merged now @mzaks! Thanks for the contribution 🚀

JoeLoser avatar Nov 18 '24 18:11 JoeLoser

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

modularbot avatar Nov 18 '24 18:11 modularbot

I tried a performance improvement based on static B+Tree described here: https://en.algorithmica.org/hpc/data-structures/s-tree/

It looks promising, but I think it will be better as a separate PR. @JoeLoser what do you think?

Agree re separate PR if you think there's perf optimizations to be had here. Always like keeping it simpler as we're getting started.

JoeLoser avatar Nov 19 '24 13:11 JoeLoser

I tried a performance improvement based on static B+Tree described here: https://en.algorithmica.org/hpc/data-structures/s-tree/ It looks promising, but I think it will be better as a separate PR. @JoeLoser what do you think?

Agree re separate PR if you think there's perf optimizations to be had here. Always like keeping it simpler as we're getting started.

Yes I was also thinking along the same line, this PR is quite simple and provides a correct implementation of casing. I will create follow up PRs to introduce performance relevant changes.

mzaks avatar Nov 19 '24 16:11 mzaks

Landed in f18cd74f2986aa52274564ac55afe589a7a3c6f6! Thank you for your contribution 🎉

modularbot avatar Nov 20 '24 22:11 modularbot