strsim-rs [Suggestion] performance improvement by avoiding character counts and string comparisons whenever this is possible

[Suggestion] performance improvement by avoiding character counts and string comparisons whenever this is possible

Open lovasoa opened this issue 6 years ago • 9 comments

In the implementation of the jaro distance, there is a useless loop on all the characters of both strings : https://github.com/dguo/strsim-rs/blob/master/src/lib.rs#L62-L63

string.chars().count() runs in O(n) on the length of string

Nov 01 '18 15:11 lovasoa

the a == b at the beginning is also O(n) on the length of the strings if they hay the same length.

Nov 01 '18 15:11 lovasoa

Yep, I'm aware that it's O(n). The reason we have to use it anyway is that string.len() returns the number of bytes, not the number of characters. See this StackOverflow question for more details.

Nov 01 '18 15:11 dguo

I think you do not have to have two loops here. Can't you do that in just one loop, updating both lengths at the same time ? And removing the a==b at the beginning ?

Nov 01 '18 15:11 lovasoa

I am now looking at the levenshtein implementation, and the same problem seems to be present. In order to implement levenshtein, you do not even need to now the length of both strings beforehand.

Nov 01 '18 16:11 lovasoa

In normalized_levenshtein, there is even a new iteration over both strings to recalculate both lengths (that were already computed in the levenshtein function.

Nov 01 '18 16:11 lovasoa

Just fyi, yesterday I started making a set of commits to push that I will push soon (when I've finished them and next got an internet connection). This happens to include some optimisations in this very area.

Nov 02 '18 17:11 jnqnfe

I accidentally lost all of the work I'd done by deleting the directory I was working in, but I've redone it all and pushed (optimisations included): https://github.com/dguo/strsim-rs/pull/31

Nov 04 '18 19:11 jnqnfe

I think that after removing a == b, the function generic_jaro doesn't handle properly the case when both sequences have length one: https://github.com/dguo/strsim-rs/blob/c4cdd9c35dfaf7fa4e5e023d22854180b114dd9c/src/lib.rs#L72

Dec 12 '19 21:12 vladimirvrabely

@vvrably, thanks for the heads up. I released the fix as v0.9.3.

Dec 13 '19 02:12 dguo

We could still get rid of the useless count in normalized levenshtein (and I probably will when we update the generic function signatures). However this likely will not make a big difference. All other redundant iterations should now be gone, since we don't perform e.g. the equality checks anymore.

Jan 04 '24 21:01 maxbachmann

strsim-rs strsim-rs copied to clipboard

[Suggestion] performance improvement by avoiding character counts and string comparisons whenever this is possible

strsim-rs
strsim-rs copied to clipboard