Investigate table compression
We can save about 37KB of tables, while possibly improving the performance.
See
- https://x.com/the_moisrex/status/1880430729584419113?s=61&t=v2gDAuOzz1C3ICrzUYCSlQ
- https://t.me/s/webpp?before=588
Also Unicode 16.0 has come and has some updates on normalization algorithms and some new code points that change the tables (most likely).
And, also, I remember that you had NFKC and NFKD composition and decomposition code points in your tables which is not being used, but I might be remembering another repo!
These are in your slow path algorithms mostly, so I'm aware they're not a priority.
It's about 110 KiB save:
- IDNA Mapping tables: yours
84.47 KiB-> mine45.03 KiB- Decomposition tables: yours
73.43-> mine41 KiB- Composition tables: yours
45.23 KiB-> mine14 KiB- CCC tables: yours
21 KiB-> mine13 KiBSo, I have saved
224.13 - 113.03 ~= 111 KiB.
You can save ~7 KiB on Bidi tables as well; and also move to an O(1) algorithm instead of binary search.
Also, @missing values from DerivedBidiClass are I believe return None value as direction all the while these are the missing values that it should return.
# All code points not explicitly listed for Bidi_Class
# have the value Left_To_Right (L).
# @missing: 0000..10FFFF; Left_To_Right
# 0590..05FF Hebrew
# @missing: 0590..05FF; Right_To_Left
# 0600..06FF Arabic
# 0700..074F Syriac
# 0750..077F Arabic_Supplement
# 0780..07BF Thaana
# @missing: 0600..07BF; Arabic_Letter
# 07C0..07FF NKo
# 0800..083F Samaritan
# 0840..085F Mandaic
# @missing: 07C0..085F; Right_To_Left
# 0860..086F Syriac_Supplement
# 0870..089F Arabic_Extended_B
# 08A0..08FF Arabic_Extended_A
# @missing: 0860..08FF; Arabic_Letter
# 20A0..20CF Currency_Symbols
# @missing: 20A0..20CF; European_Terminator
# FB00..FB4F Alphabetic_Presentation_Forms (partial)
# @missing: FB1D..FB4F; Right_To_Left
# FB50..FDFF Arabic_Presentation_Forms_A (partial)
# @missing: FB50..FDCF; Arabic_Letter
# FB50..FDFF Arabic_Presentation_Forms_A (partial)
# @missing: FDF0..FDFF; Arabic_Letter
# FE70..FEFF Arabic_Presentation_Forms_B
# @missing: FE70..FEFF; Arabic_Letter
I'm not sure even if it can affect anything at all, but if anywhere it would, it would be here:
https://github.com/ada-url/idna/blob/513c81448e0ea8954da46a577ae75476d4ae8a51/src/validity.cpp#L1273-L1290
These are some of the code points that return None while their value is something else (most-likely L) (randomly selected values):
888 (U+0378)
889
912053
402079
795382
847067
369020
754907
374357
306022
782383
632383
307901
624072
106431
112636
817816
862976
489736
758840
109482
221356
1114111
For example U+0378's BidiClass is Left-to-Right (L), while find_direction returns direction::None.
Again, this might be algorithmically fixed, I'm not yet sure if it has an impact in ada-url or not.