unicode-transforms
unicode-transforms copied to clipboard
WIP: Quick check implementation
Implement Quick Check algorithm.
This is very much WIP; it is intended to open the discussion on the implementation.
It relies on an ongoing PR of unicode-data.
- [ ] Add proper benchmarks (not/almost) normalized.
- [ ] ~~Release
unicode-datawith QuickCheck properties.~~ - [ ] ~~Delete
cabal.project;~~ changestack.yaml~~ands390x.yaml~~ to useunicode-datarelease.
Fixes #1
Latest benchmark
All
unicode-transforms-text
NFD/AllChars: OK (2.76s)
5.41 ms ± 126 μs, 7% faster than baseline
NFD/Deutsch: OK (0.59s)
2.27 ms ± 66 μs, 27% faster than baseline
NFD/Devanagari: OK (0.33s)
5.09 ms ± 184 μs
NFD/English: OK (13.85s)
1.67 ms ± 11 μs, 32% faster than baseline
NFD/Japanese: OK (0.42s)
6.49 ms ± 227 μs
NFD/Korean: OK (1.22s)
9.58 ms ± 99 μs
NFD/Vietnamese: OK (0.30s)
4.68 ms ± 179 μs, 17% faster than baseline
NFKD/AllChars: OK (1.11s)
8.68 ms ± 201 μs, 11% faster than baseline
NFKD/Deutsch: OK (0.63s)
2.42 ms ± 65 μs, 32% faster than baseline
NFKD/Devanagari: OK (0.35s)
5.37 ms ± 207 μs
NFKD/English: OK (4.06s)
2.00 ms ± 24 μs, 34% faster than baseline
NFKD/Japanese: OK (0.47s)
7.51 ms ± 238 μs, 8% slower than baseline
NFKD/Korean: OK (0.64s)
10.0 ms ± 292 μs
NFKD/Vietnamese: OK (0.63s)
4.95 ms ± 167 μs, 18% faster than baseline
NFC/AllChars: OK (1.24s)
9.80 ms ± 92 μs, 10% slower than baseline
NFC/Deutsch: OK (0.53s)
4.16 ms ± 99 μs, 14% slower than baseline
NFC/Devanagari: OK (0.82s)
6.39 ms ± 92 μs, 8% slower than baseline
NFC/English: OK (0.44s)
3.40 ms ± 121 μs, 18% slower than baseline
NFC/Japanese: OK (0.64s)
10.1 ms ± 186 μs
NFC/Korean: OK (0.31s)
4.89 ms ± 174 μs, 3% slower than baseline
NFC/Vietnamese: OK (0.39s)
12.5 ms ± 463 μs, 6% slower than baseline
NFKC/AllChars: OK (0.44s)
14.0 ms ± 368 μs
NFKC/Deutsch: OK (0.32s)
5.28 ms ± 210 μs, 31% slower than baseline
NFKC/Devanagari: OK (0.42s)
6.52 ms ± 170 μs, 13% slower than baseline
NFKC/English: OK (0.47s)
3.64 ms ± 103 μs, 41% slower than baseline
NFKC/Japanese: OK (0.36s)
11.7 ms ± 464 μs
NFKC/Korean: OK (1.40s)
5.53 ms ± 64 μs, 18% slower than baseline
NFKC/Vietnamese: OK (0.40s)
12.8 ms ± 447 μs, 6% slower than baseline
So some improvements for decomposition but composition is slower. NFKC is also failling time to time when compared to ICU.
Edited: use <summary> tag.
@harendra-kumar I managed to obtain nice improvements:
Benchmark results
All
unicode-transforms-text
NFD/AllChars: OK (1.31s)
5.13 ms ± 66 μs, 9.2 MB allocated, 815 B copied, 31 MB peak memory, 0.72x, 17% faster than baseline
NFD/Deutsch: OK (1.12s)
2.19 ms ± 24 μs, 6.3 MB allocated, 807 B copied, 31 MB peak memory, 0.62x, 33% faster than baseline
NFD/Devanagari: OK (1.21s)
4.70 ms ± 44 μs, 5.7 MB allocated, 799 B copied, 31 MB peak memory, 0.61x, 7% faster than baseline
NFD/English: OK (1.70s)
1.65 ms ± 17 μs, 5.7 MB allocated, 791 B copied, 31 MB peak memory, 0.54x, 41% faster than baseline
NFD/Japanese: OK (0.77s)
6.02 ms ± 84 μs, 9.3 MB allocated, 775 B copied, 31 MB peak memory, 0.70x, 2% faster than baseline
NFD/Korean: OK (1.25s)
9.71 ms ± 154 μs, 15 MB allocated, 1.6 KB copied, 32 MB peak memory, 0.32x
NFD/Vietnamese: OK (1.59s)
6.21 ms ± 44 μs, 9.3 MB allocated, 759 B copied, 32 MB peak memory, 1.12x, 11% faster than baseline
NFKD/AllChars: OK (1.14s)
8.97 ms ± 101 μs, 12 MB allocated, 1.5 KB copied, 32 MB peak memory, 0.94x, 9% faster than baseline
NFKD/Deutsch: OK (1.40s)
2.71 ms ± 50 μs, 6.3 MB allocated, 743 B copied, 32 MB peak memory, 0.77x, 23% faster than baseline
NFKD/Devanagari: OK (0.63s)
4.91 ms ± 96 μs, 5.7 MB allocated, 735 B copied, 32 MB peak memory, 0.63x, 9% faster than baseline
NFKD/English: OK (1.14s)
2.20 ms ± 22 μs, 5.7 MB allocated, 727 B copied, 32 MB peak memory, 0.72x, 27% faster than baseline
NFKD/Japanese: OK (0.85s)
6.64 ms ± 117 μs, 9.7 MB allocated, 729 B copied, 32 MB peak memory, 0.70x, 3% faster than baseline
NFKD/Korean: OK (1.27s)
10.0 ms ± 97 μs, 15 MB allocated, 1.5 KB copied, 32 MB peak memory, 0.33x
NFKD/Vietnamese: OK (1.68s)
6.57 ms ± 45 μs, 9.3 MB allocated, 703 B copied, 32 MB peak memory, 0.62x, 8% faster than baseline
NFC/AllChars: OK (1.14s)
4.51 ms ± 57 μs, 3.8 MB allocated, 347 B copied, 32 MB peak memory, 1.32x, 54% faster than baseline
NFC/Deutsch: OK (1.01s)
1.98 ms ± 38 μs, 1.9 MB allocated, 227 B copied, 32 MB peak memory, 1.30x, 47% faster than baseline
NFC/Devanagari: OK (1.48s)
5.77 ms ± 45 μs, 6.1 MB allocated, 679 B copied, 32 MB peak memory, 0.71x, 3% faster than baseline
NFC/English: OK (2.04s)
1.97 ms ± 18 μs, 1.9 MB allocated, 223 B copied, 32 MB peak memory, 1.28x, 23% faster than baseline
NFC/Japanese: OK (1.60s)
3.14 ms ± 24 μs, 1.9 MB allocated, 219 B copied, 32 MB peak memory, 1.25x, 68% faster than baseline
NFC/Korean: OK (1.12s)
4.37 ms ± 56 μs, 1.9 MB allocated, 215 B copied, 32 MB peak memory, 1.10x, 2% faster than baseline
NFC/Vietnamese: OK (1.56s)
12.3 ms ± 205 μs, 11 MB allocated, 1.4 KB copied, 32 MB peak memory, 0.86x, 25% faster than baseline
NFKC/AllChars: OK (1.21s)
9.58 ms ± 86 μs, 13 MB allocated, 1.2 KB copied, 32 MB peak memory, 1.05x, 36% faster than baseline
NFKC/Deutsch: OK (1.33s)
2.58 ms ± 27 μs, 5.7 MB allocated, 623 B copied, 32 MB peak memory, 0.92x, 40% faster than baseline
NFKC/Devanagari: OK (0.73s)
5.73 ms ± 101 μs, 6.1 MB allocated, 615 B copied, 32 MB peak memory, 0.71x, 10% faster than baseline
NFKC/English: OK (1.03s)
1.98 ms ± 29 μs, 1.9 MB allocated, 201 B copied, 32 MB peak memory, 1.32x, 15% faster than baseline
NFKC/Japanese: OK (4.59s)
4.58 ms ± 17 μs, 6.7 MB allocated, 599 B copied, 32 MB peak memory, 0.94x, 64% faster than baseline
NFKC/Korean: OK (1.36s)
5.33 ms ± 51 μs, 5.7 MB allocated, 591 B copied, 32 MB peak memory, 0.71x, 4% slower than baseline
NFKC/Vietnamese: OK (0.75s)
12.1 ms ± 234 μs, 11 MB allocated, 1.3 KB copied, 32 MB peak memory, 0.84x, 31% faster than baseline
Some remarks:
- I added lots of comments to understand and reorganized the code to understand better how it works.
- I used
{-# SCC #-}annotations for profiling; should I keep them? - Quick Check values are encoded in this package and not
unicode-data, because for NFC & NFKC they uses a different encoding thanNo, Maybe, Yes. - Allocated & copied memory quite reduced for Japanese & AllChars.
- I do not see other optimization at the moment.
- Do we want
isNormalizedfunction? If so the current encoding of QC for composition may not be usable. isNFC_QCandisNFKC_QCencode value in a byte, but really use 2 bits. Do we need to improve that? We could encode like this:[isNFC_QC_custom,isNFC_QC_original,isNFKC_QC_original,isNFKC_QC_custom]and select the corresponding pair of bits we need, but it implies overhead.- Need to add benchmark for variants of the files: already normalized.
@wismill these results look great! Considering that we were already comparable or faster than text-icu in most cases. Now we are even better. Also we are now comparable in the cases where we were slower earlier (NFC and NFKC for Japanese, Deutsch and AllChars).
I will take a better look when I get some time.
I updated the benchmark results to include allocated memory & diff with text-icu..
I am looking at it, will update soon.
I see the following benchmarks getting slower:
All
unicode-transforms-text
NFD/Korean: OK (0.20s)
13.5 ms ± 983 μs, 32% slower than baseline
NFKD/Korean: OK (0.21s)
13.8 ms ± 746 μs, 36% slower than baseline
NFC/Korean: OK (0.26s)
8.20 ms ± 348 μs, 30% slower than baseline
NFC/Deutsch: OK (0.18s)
5.63 ms ± 350 μs, 18% slower than baseline
NFC/English: OK (0.18s)
5.61 ms ± 453 μs, 23% slower than baseline
NFKC/Korean: OK (0.27s)
8.59 ms ± 374 μs, 25% slower than baseline
NFKC/Deutsch: OK (2.90s)
5.72 ms ± 66 μs, 6% slower than baseline
NFKC/English: OK (1.10s)
4.26 ms ± 234 μs, 11% slower than baseline
Mainly Korean has significant regression. We can try to review and see if there are any opportunities to correct these.
I added lots of comments to understand and reorganized the code to understand better how it works.
thanks
I used {-# SCC #-} annotations for profiling; should I keep them?
Your judgement, if you think they might be helpful keep them.
Quick Check values are encoded in this package and not unicode-data, because for NFC & NFKC they uses a different encoding than No, Maybe, Yes.
We can still have it in unicode-data. It is not necessary to use the same encoding as in the original db, but whaetver seems generally useful as per the algo should be ok.
Do we want isNormalized function? If so the current encoding of QC for composition may not be usable.
I do not know. Unless someone requests it.
We are faster than text-icu in all cases except the following:
unicode-transforms
NFC/Korean: OK (0.26s)
8.20 ms ± 348 μs, 30% slower than baseline
NFC/Deutsch: OK (0.18s)
5.63 ms ± 350 μs, 18% slower than baseline
NFC/English: OK (0.18s)
5.61 ms ± 453 μs, 23% slower than baseline
NFC/Japanese: OK (0.17s)
6.10 ms ± 383 μs, 61% faster than baseline
NFC/AllChars: OK (0.24s)
7.61 ms ± 340 μs, 44% faster than baseline
NFKC/Deutsch: OK (2.90s)
5.72 ms ± 66 μs, 6% slower than baseline
NFKC/English: OK (1.10s)
4.26 ms ± 234 μs, 11% slower than baseline
NFKC/AllChars: OK (0.20s)
13.5 ms ± 1.3 ms, 31% faster than baseline
text-icu
NFC/Korean: OK (0.15s)
5.31 ms ± 366 μs
NFC/Deutsch: OK (0.19s)
2.90 ms ± 197 μs
NFC/English: OK (0.37s)
2.87 ms ± 211 μs
NFC/Japanese: OK (0.13s)
4.02 ms ± 374 μs
NFC/AllChars: OK (0.14s)
4.47 ms ± 354 μs
NFKC/Deutsch: OK (0.29s)
4.56 ms ± 269 μs
NFKC/English: OK (0.17s)
2.92 ms ± 184 μs
NFKC/AllChars: OK (0.18s)
11.9 ms ± 987 μs
We are slower only in some NFC cases and some of those are the ones that regressed with this change. So maybe we can improve those.
I see the following benchmarks getting slower:
All unicode-transforms-text NFD/Korean: OK (0.20s) 13.5 ms ± 983 μs, 32% slower than baseline NFKD/Korean: OK (0.21s) 13.8 ms ± 746 μs, 36% slower than baseline NFC/Korean: OK (0.26s) 8.20 ms ± 348 μs, 30% slower than baseline NFC/Deutsch: OK (0.18s) 5.63 ms ± 350 μs, 18% slower than baseline NFC/English: OK (0.18s) 5.61 ms ± 453 μs, 23% slower than baseline NFKC/Korean: OK (0.27s) 8.59 ms ± 374 μs, 25% slower than baseline NFKC/Deutsch: OK (2.90s) 5.72 ms ± 66 μs, 6% slower than baseline NFKC/English: OK (1.10s) 4.26 ms ± 234 μs, 11% slower than baselineMainly Korean has significant regression. We can try to review and see if there are any opportunities to correct these.
I do not observe these results. Mine are all same or faster than baseline (ghc 9.2.4).
Results (NNNx is compared to ICU, NNN% is against baseline)
All
unicode-transforms-te
NFD/AllChars: OK (0.37s)
5.73 ms ± 207 μs, 0.76x, 7% less than baseline
NFD/Deutsch: OK (0.76s)
2.94 ms ± 85 μs, 0.80x, 17% less than baseline
NFD/Devanagari: OK (0.33s)
5.17 ms ± 189 μs, 0.66x, same as baseline
NFD/English: OK (0.33s)
2.49 ms ± 90 μs, 0.79x, 18% less than baseline
NFD/Japanese: OK (0.41s)
6.46 ms ± 172 μs, 0.69x, 7% more than baseline
NFD/Korean: OK (0.62s)
9.86 ms ± 183 μs, 0.32x, same as baseline
NFD/Vietnamese: OK (0.44s)
6.88 ms ± 216 μs, 1.21x, same as baseline
NFKD/AllChars: OK (0.54s)
8.51 ms ± 271 μs, 0.85x, 10% less than baseline
NFKD/Deutsch: OK (0.62s)
2.40 ms ± 81 μs, 0.67x, 26% less than baseline
NFKD/Devanagari: OK (0.59s)
4.66 ms ± 109 μs, 0.60x, 13% less than baseline
NFKD/English: OK (0.50s)
1.92 ms ± 45 μs, 0.61x, 30% less than baseline
NFKD/Japanese: OK (0.83s)
6.56 ms ± 107 μs, 0.64x, 6% less than baseline
NFKD/Korean: OK (0.61s)
9.66 ms ± 311 μs, 0.31x, same as baseline
NFKD/Vietnamese: OK (0.39s)
6.06 ms ± 169 μs, 0.55x, 13% less than baseline
NFC/AllChars: OK (0.45s)
3.53 ms ± 87 μs, 0.97x, 63% less than baseline
NFC/Deutsch: OK (0.38s)
1.43 ms ± 48 μs, 0.88x, 62% less than baseline
NFC/Devanagari: OK (0.67s)
5.23 ms ± 120 μs, 0.63x, 12% less than baseline
NFC/English: OK (0.38s)
1.42 ms ± 56 μs, 0.86x, 49% less than baseline
NFC/Japanese: OK (0.33s)
2.62 ms ± 90 μs, 0.98x, 75% less than baseline
NFC/Korean: OK (0.53s)
4.08 ms ± 89 μs, 1.00x, 14% less than baseline
NFC/Vietnamese: OK (0.73s)
11.8 ms ± 307 μs, 0.82x, 24% less than baseline
NFKC/AllChars: OK (0.56s)
9.12 ms ± 202 μs, 0.97x, 36% less than baseline
NFKC/Deutsch: OK (0.66s)
2.55 ms ± 56 μs, 0.87x, 37% less than baseline
NFKC/Devanagari: OK (0.72s)
5.55 ms ± 94 μs, 0.68x, 7% less than baseline
NFKC/English: OK (0.52s)
1.98 ms ± 45 μs, 1.23x, 30% less than baseline
NFKC/Japanese: OK (0.47s)
3.77 ms ± 111 μs, 0.75x, 67% less than baseline
NFKC/Korean: OK (0.30s)
4.83 ms ± 177 μs, 0.64x, 6% less than baseline
NFKC/Vietnamese: OK (0.74s)
12.0 ms ± 255 μs, 0.83x, 22% less than baseline
You may want to use --stdev 1 as the results can be fickle.
I do not observe these results. Mine are all same or faster than baseline (ghc 9.2.4).
It may depend on the CPU as well. I tested on Linux, AWS VM - which says "model name : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz" in "/proc/cpuinfo". Which CPU/Hardware are you testing on?
Sure thing. I tested on a laptop equipped with AMD Ryzen 5 2500U @ 2.0GHz, on Linux.
I tested on a different machine:
Results with GHC 9.4.1 on 4 × Intel® Core™ i5-3340M CPU @ 2.70GHz
All
unicode-transforms-text
NFD/AllChars: OK (2.83s)
11.0 ms ± 62 μs, 5% less than baseline
NFD/Deutsch: OK (0.87s)
6.81 ms ± 121 μs, 5% less than baseline
NFD/Devanagari: OK (1.56s)
12.2 ms ± 206 μs, 2% less than baseline
NFD/English: OK (1.60s)
6.23 ms ± 51 μs, 4% less than baseline
NFD/Japanese: OK (0.97s)
15.3 ms ± 267 μs, same as baseline
NFD/Korean: OK (1.19s)
18.8 ms ± 270 μs, 4% more than baseline
NFD/Vietnamese: OK (1.50s)
11.8 ms ± 93 μs, 4% less than baseline
NFKD/AllChars: OK (0.98s)
15.4 ms ± 289 μs, 3% less than baseline
NFKD/Deutsch: OK (1.91s)
7.67 ms ± 98 μs, 2% more than baseline
NFKD/Devanagari: OK (1.59s)
12.2 ms ± 112 μs, 2% less than baseline
NFKD/English: OK (0.77s)
5.99 ms ± 102 μs, 12% less than baseline
NFKD/Japanese: OK (0.99s)
15.7 ms ± 192 μs, 1% less than baseline
NFKD/Korean: OK (1.19s)
18.7 ms ± 363 μs, 3% more than baseline
NFKD/Vietnamese: OK (2.96s)
11.6 ms ± 175 μs, 4% less than baseline
NFC/AllChars: OK (5.88s)
11.5 ms ± 95 μs, 43% less than baseline
NFC/Deutsch: OK (2.12s)
8.24 ms ± 111 μs, 28% less than baseline
NFC/Devanagari: OK (1.93s)
15.2 ms ± 175 μs, 18% less than baseline
NFC/English: OK (2.09s)
8.16 ms ± 144 μs, 21% less than baseline
NFC/Japanese: OK (3.00s)
11.7 ms ± 123 μs, 50% less than baseline
NFC/Korean: OK (1.80s)
14.1 ms ± 195 μs, 2% more than baseline
NFC/Vietnamese: OK (0.61s)
19.5 ms ± 351 μs, 21% less than baseline
NFKC/AllChars: OK (0.99s)
15.6 ms ± 220 μs, 34% less than baseline
NFKC/Deutsch: OK (0.56s)
8.81 ms ± 169 μs, 22% less than baseline
NFKC/Devanagari: OK (0.97s)
15.2 ms ± 169 μs, 16% less than baseline
NFKC/English: OK (1.09s)
8.49 ms ± 96 μs, 16% less than baseline
NFKC/Japanese: OK (0.80s)
12.6 ms ± 212 μs, 45% less than baseline
NFKC/Korean: OK (0.91s)
14.6 ms ± 250 μs, 2% more than baseline
NFKC/Vietnamese: OK (1.24s)
19.5 ms ± 218 μs, 21% less than baseline
@harendra-kumar I observe some regressions but not as close as the one you got. Could you please re-run the benchmark carefully?
@wismill can you rebase this on master?
@harendra-kumar done
Sorry for a long delay. I benchmarked again, the benchmarks look good.
It seems tests are failing in the CIs. Can you take a look at those and make these pass?