strsim-rs icon indicating copy to clipboard operation
strsim-rs copied to clipboard

avoid string copy in sorensen dice

Open maxbachmann opened this issue 1 year ago • 1 comments

Since we only need to iterate over the bigrams for each string once, we can create them lazily instead of collecting them into a string. This reduces the binary size by around 7%. In addition it reduces runtime in our current benchmark by around 11%.

For reference in my example binary this gives:

File  .text     Size Crate
6.0%  96.7% 275.8KiB std
0.1%   1.5%   4.3KiB strsim
0.0%   0.0%     124B rf_test
0.0%   0.0%     102B [Unknown]
6.3% 100.0% 285.3KiB .text section size, the file size is 4.5MiB

while previously it was:

File  .text     Size Crate
6.1%  96.6% 276.5KiB std
0.1%   1.6%   4.6KiB strsim
0.0%   0.0%     124B rf_test
0.0%   0.0%     102B [Unknown]
6.3% 100.0% 286.3KiB .text section size, the file size is 4.5MiB

maxbachmann avatar Jan 04 '24 11:01 maxbachmann

It should be possible to improve this quite a bit further by using the same hashmap used for the damerau-levenshtein implementation. I made a quick experiment which reduced runtime by another 64% and while reducing binary size by another 38%. This version was just a quick experiment and doesn't calculate the correct score yet. So it could just be faster + smaller since it's broken :man_shrugging:

maxbachmann avatar Jan 04 '24 12:01 maxbachmann