idna icon indicating copy to clipboard operation
idna copied to clipboard

Investigate table compression

Open lemire opened this issue 11 months ago • 3 comments

We can save about 37KB of tables, while possibly improving the performance.

See

  • https://x.com/the_moisrex/status/1880430729584419113?s=61&t=v2gDAuOzz1C3ICrzUYCSlQ
  • https://t.me/s/webpp?before=588

lemire avatar Jan 18 '25 15:01 lemire

Also Unicode 16.0 has come and has some updates on normalization algorithms and some new code points that change the tables (most likely).

And, also, I remember that you had NFKC and NFKD composition and decomposition code points in your tables which is not being used, but I might be remembering another repo!

These are in your slow path algorithms mostly, so I'm aware they're not a priority.

the-moisrex avatar Jan 18 '25 15:01 the-moisrex

It's about 110 KiB save:

  • IDNA Mapping tables: yours 84.47 KiB -> mine 45.03 KiB
  • Decomposition tables: yours 73.43 -> mine 41 KiB
  • Composition tables: yours 45.23 KiB -> mine 14 KiB
  • CCC tables: yours 21 KiB -> mine 13 KiB

So, I have saved 224.13 - 113.03 ~= 111 KiB.

the-moisrex avatar Mar 08 '25 12:03 the-moisrex

You can save ~7 KiB on Bidi tables as well; and also move to an O(1) algorithm instead of binary search.

Also, @missing values from DerivedBidiClass are I believe return None value as direction all the while these are the missing values that it should return.


#  All code points not explicitly listed for Bidi_Class
#  have the value Left_To_Right (L).

# @missing: 0000..10FFFF; Left_To_Right

# 0590..05FF Hebrew
# @missing: 0590..05FF; Right_To_Left

# 0600..06FF Arabic
# 0700..074F Syriac
# 0750..077F Arabic_Supplement
# 0780..07BF Thaana
# @missing: 0600..07BF; Arabic_Letter

# 07C0..07FF NKo
# 0800..083F Samaritan
# 0840..085F Mandaic
# @missing: 07C0..085F; Right_To_Left

# 0860..086F Syriac_Supplement
# 0870..089F Arabic_Extended_B
# 08A0..08FF Arabic_Extended_A
# @missing: 0860..08FF; Arabic_Letter

# 20A0..20CF Currency_Symbols
# @missing: 20A0..20CF; European_Terminator

# FB00..FB4F Alphabetic_Presentation_Forms (partial)
# @missing: FB1D..FB4F; Right_To_Left

# FB50..FDFF Arabic_Presentation_Forms_A (partial)
# @missing: FB50..FDCF; Arabic_Letter

# FB50..FDFF Arabic_Presentation_Forms_A (partial)
# @missing: FDF0..FDFF; Arabic_Letter

# FE70..FEFF Arabic_Presentation_Forms_B
# @missing: FE70..FEFF; Arabic_Letter

I'm not sure even if it can affect anything at all, but if anywhere it would, it would be here:

https://github.com/ada-url/idna/blob/513c81448e0ea8954da46a577ae75476d4ae8a51/src/validity.cpp#L1273-L1290

These are some of the code points that return None while their value is something else (most-likely L) (randomly selected values):

 888 (U+0378)
 889
 912053
 402079
 795382
 847067
 369020
 754907
 374357
 306022
 782383
 632383
 307901
 624072
 106431
 112636
 817816
 862976
 489736
 758840
 109482
 221356
 1114111

For example U+0378's BidiClass is Left-to-Right (L), while find_direction returns direction::None.

Again, this might be algorithmically fixed, I'm not yet sure if it has an impact in ada-url or not.

the-moisrex avatar Mar 29 '25 04:03 the-moisrex