strsim-rs
                                
                                 strsim-rs copied to clipboard
                                
                                    strsim-rs copied to clipboard
                            
                            
                            
                        [Suggestion] performance improvement by avoiding character counts and string comparisons whenever this is possible
In the implementation of the jaro distance, there is a useless loop on all the characters of both strings : https://github.com/dguo/strsim-rs/blob/master/src/lib.rs#L62-L63
string.chars().count() runs in O(n) on the length of string
the a == b at the beginning is also O(n) on the length of the strings if they hay the same length.
Yep, I'm aware that it's O(n). The reason we have to use it anyway is that string.len() returns the number of bytes, not the number of characters. See this StackOverflow question for more details.
I think you do not have to have two loops here. Can't you do that in just one loop, updating both lengths at the same time ? And removing the a==b at the beginning ?
I am now looking at the levenshtein implementation, and the same problem seems to be present. In order to implement levenshtein, you do not even need to now the length of both strings beforehand.
In normalized_levenshtein, there is even a new iteration over both strings to recalculate both lengths (that were already computed in the levenshtein function.
Just fyi, yesterday I started making a set of commits to push that I will push soon (when I've finished them and next got an internet connection). This happens to include some optimisations in this very area.
I accidentally lost all of the work I'd done by deleting the directory I was working in, but I've redone it all and pushed (optimisations included): https://github.com/dguo/strsim-rs/pull/31
I think that after removing a == b, the function generic_jaro doesn't handle properly the case when both sequences have length one: https://github.com/dguo/strsim-rs/blob/c4cdd9c35dfaf7fa4e5e023d22854180b114dd9c/src/lib.rs#L72
@vvrably, thanks for the heads up. I released the fix as v0.9.3.
We could still get rid of the useless count in normalized levenshtein (and I probably will when we update the generic function signatures). However this likely will not make a big difference. All other redundant iterations should now be gone, since we don't perform e.g. the equality checks anymore.