unidecode
unidecode copied to clipboard
Cache char ASCII equivalence to speed-up transliteration
Trying to make the transliteration process faster without C compliation I've come up with something good enough.
Following the section caching, I've set up another dict cache to cache char equivalences.
Speed tests
Basic case
Old code:
python -m timeit -s 'from unidecode import unidecode' 'unidecode("Hello world! ñ ŀ ¢")'
100000 loops, best of 3: 4.09 usec per loop
New code with char caching:
python -m timeit -s 'from unidecode import unidecode' 'unidecode("Hello world! ñ ŀ ¢")'
100000 loops, best of 3: 2.41 usec per loop
This case represents a 40% speedup.
Best case scenario (all equal chars)
Old code:
python -m timeit -s 'from unidecode import unidecode' 'unidecode("aaaaa")'
1000000 loops, best of 3: 1.05 usec per loop
New caching code:
python -m timeit -s 'from unidecode import unidecode' 'unidecode("aaaaa")'
1000000 loops, best of 3: 0.848 usec per loop
Bad case scenario (random chars)
Old code:
python -m timeit -s 'from unidecode import unidecode; import random; import string' 'unidecode("".join(random.choice(string.ascii_letters) for x in range(10)))'
100000 loops, best of 3: 12.9 usec per loop
New code:
python -m timeit -s 'from unidecode import unidecode; import random; import string' 'unidecode("".join(random.choice(string.ascii_letters) for x in range(10)))'
100000 loops, best of 3: 12.2 usec per loop
Considerations
- I know the bad case scenario I presented is not really the worst, since the worst would be transliterating always different characters on each call. That't harder to test, however extremely unlikely to happen in practice. I believe most uses will transliterate a reduced set of characters, and then the char cache will kick in and be faster. Bigger speedups probably correspond to transliterating non-ascii chars. For example, transliterating "áéíóúñ", which are the most common non-ascii chars in the Spanish language, takes less than 1 usec (in average) with caching, and almost 3 usecs without it (3x speedup).
- The character cache takes memory and I've not conducted memory tests because I don't know how. However I believe that (a) a dictionary {char → char} shouldn't take a lot of memory and (b) most practical uses will likely transliterate a reduced set of characters (from a reduced set of languages) so the cache shouldn't grow too big.