unidecode icon indicating copy to clipboard operation
unidecode copied to clipboard

Cache char ASCII equivalence to speed-up transliteration

Open dukebody opened this issue 10 years ago • 0 comments

Trying to make the transliteration process faster without C compliation I've come up with something good enough.

Following the section caching, I've set up another dict cache to cache char equivalences.

Speed tests

Basic case

Old code:

python -m timeit -s 'from unidecode import unidecode' 'unidecode("Hello world! ñ ŀ ¢")'
100000 loops, best of 3: 4.09 usec per loop

New code with char caching:

python -m timeit -s 'from unidecode import unidecode' 'unidecode("Hello world! ñ ŀ ¢")'
100000 loops, best of 3: 2.41 usec per loop

This case represents a 40% speedup.

Best case scenario (all equal chars)

Old code:

python -m timeit -s 'from unidecode import unidecode' 'unidecode("aaaaa")'
1000000 loops, best of 3: 1.05 usec per loop

New caching code:

python -m timeit -s 'from unidecode import unidecode' 'unidecode("aaaaa")'
1000000 loops, best of 3: 0.848 usec per loop

Bad case scenario (random chars)

Old code:

python -m timeit -s 'from unidecode import unidecode; import random; import string' 'unidecode("".join(random.choice(string.ascii_letters) for x in range(10)))'
100000 loops, best of 3: 12.9 usec per loop

New code:

python -m timeit -s 'from unidecode import unidecode; import random; import string' 'unidecode("".join(random.choice(string.ascii_letters) for x in range(10)))'
100000 loops, best of 3: 12.2 usec per loop

Considerations

  1. I know the bad case scenario I presented is not really the worst, since the worst would be transliterating always different characters on each call. That't harder to test, however extremely unlikely to happen in practice. I believe most uses will transliterate a reduced set of characters, and then the char cache will kick in and be faster. Bigger speedups probably correspond to transliterating non-ascii chars. For example, transliterating "áéíóúñ", which are the most common non-ascii chars in the Spanish language, takes less than 1 usec (in average) with caching, and almost 3 usecs without it (3x speedup).
  2. The character cache takes memory and I've not conducted memory tests because I don't know how. However I believe that (a) a dictionary {char → char} shouldn't take a lot of memory and (b) most practical uses will likely transliterate a reduced set of characters (from a reduced set of languages) so the cache shouldn't grow too big.

dukebody avatar Oct 24 '15 10:10 dukebody