strsim-rs icon indicating copy to clipboard operation
strsim-rs copied to clipboard

[Suggestion] performance improvement by avoiding character counts and string comparisons whenever this is possible

Open lovasoa opened this issue 6 years ago • 9 comments

In the implementation of the jaro distance, there is a useless loop on all the characters of both strings : https://github.com/dguo/strsim-rs/blob/master/src/lib.rs#L62-L63

string.chars().count() runs in O(n) on the length of string

lovasoa avatar Nov 01 '18 15:11 lovasoa

the a == b at the beginning is also O(n) on the length of the strings if they hay the same length.

lovasoa avatar Nov 01 '18 15:11 lovasoa

Yep, I'm aware that it's O(n). The reason we have to use it anyway is that string.len() returns the number of bytes, not the number of characters. See this StackOverflow question for more details.

dguo avatar Nov 01 '18 15:11 dguo

I think you do not have to have two loops here. Can't you do that in just one loop, updating both lengths at the same time ? And removing the a==b at the beginning ?

lovasoa avatar Nov 01 '18 15:11 lovasoa

I am now looking at the levenshtein implementation, and the same problem seems to be present. In order to implement levenshtein, you do not even need to now the length of both strings beforehand.

lovasoa avatar Nov 01 '18 16:11 lovasoa

In normalized_levenshtein, there is even a new iteration over both strings to recalculate both lengths (that were already computed in the levenshtein function.

lovasoa avatar Nov 01 '18 16:11 lovasoa

Just fyi, yesterday I started making a set of commits to push that I will push soon (when I've finished them and next got an internet connection). This happens to include some optimisations in this very area.

jnqnfe avatar Nov 02 '18 17:11 jnqnfe

I accidentally lost all of the work I'd done by deleting the directory I was working in, but I've redone it all and pushed (optimisations included): https://github.com/dguo/strsim-rs/pull/31

jnqnfe avatar Nov 04 '18 19:11 jnqnfe

I think that after removing a == b, the function generic_jaro doesn't handle properly the case when both sequences have length one: https://github.com/dguo/strsim-rs/blob/c4cdd9c35dfaf7fa4e5e023d22854180b114dd9c/src/lib.rs#L72

vladimirvrabely avatar Dec 12 '19 21:12 vladimirvrabely

@vvrably, thanks for the heads up. I released the fix as v0.9.3.

dguo avatar Dec 13 '19 02:12 dguo

We could still get rid of the useless count in normalized levenshtein (and I probably will when we update the generic function signatures). However this likely will not make a big difference. All other redundant iterations should now be gone, since we don't perform e.g. the equality checks anymore.

maxbachmann avatar Jan 04 '24 21:01 maxbachmann