strsim-rs icon indicating copy to clipboard operation
strsim-rs copied to clipboard

various improvements

Open jnqnfe opened this issue 5 years ago • 2 comments

please don't request breaking them into separate pull requests, that would be horrendous :)

I have no yet found time to review with respect to https://github.com/dguo/strsim-rs/pull/30

jnqnfe avatar Nov 04 '18 19:11 jnqnfe

Hey, I'm sorry for taking so long to get to this. I do appreciate all the work you put into it. I know you requested that I not ask you to break it up into smaller pull requests, but it is just too big. Would you be open to me helping you do so? I could go through the commits and make a checklist of discrete changes. So at least you wouldn't have to do the work of figuring out how to split it up.

dguo avatar Apr 08 '19 23:04 dguo

Sure, a plan would be great :)

jnqnfe avatar Apr 11 '19 21:04 jnqnfe

It has been quite a while since back then and a few things have changed in the meantime. One of the major changes is that going forward we will focus on implementing algorithms with reasonable performance, while keeping the binary size small. People who care about the maximum performance will be recommended to use https://github.com/rapidfuzz/rapidfuzz-rs which focuses more on performance, but has a larger binary size.

From a quick look over the changes they can be grouped into a couple of buckets:

  • update formatting: we do now use cargo fmt to keep consistent formatting

  • move tests: in rust it appears to be common to have the tests next to the code, while only placing integration tests outside. We will need to look into where we want which parts in the future

  • improve tests: We certainly need to improve them. Currently the test output is not particularly helpful

  • improve algorithms:

    • avoid calculating lengths multiple times: that probably makes sense and is already done e.g. in https://github.com/dguo/strsim-rs/pull/61
    • remove common prefix: This is generally an improvement and for this reason is done in https://github.com/rapidfuzz/rapidfuzz-rs. In fact it is often possible to remove both common prefix and postfix. However this is likely nothing we want here, since it can affect the binary size quite a bit.
    • use inclusive ranges: I opened https://github.com/dguo/strsim-rs/issues/59 for this, since while this improves readability a bit, it can negatively affect the binary size.
  • some documentation improvements like mentioning that the algorithms are not based on grapheme clusters. This could be extended to mention how to match using grapheme cluster which should be possible when using the generic versions of the metrics

I agree with @dguo that if we want to have even part of this in, we would need to split this up into multiple separate PRs. In the current form it's simply not possible to review this properly.

@jnqnfe are you still interested in working on these things?

maxbachmann avatar Jan 01 '24 16:01 maxbachmann

Currently I have far too much on my plate to do any further work on this and I don't see that changing any time soon, sorry. It's taken me three days to even find time to provide this reply.

jnqnfe avatar Jan 05 '24 01:01 jnqnfe

That's completely understandable. I just wanted to give you the option to continue working on this, since I know you did put quite a bit of work into this back then.

I did look through the changes a bit more:

  • The algorithmic improvements are probably nothing we would want. Not because they are useless, but because the project does focus on binary size nowadays and they just do not provide enough of an improvement to warrant the binary size increase.
  • The tests move is nothing we would want. I opened https://github.com/dguo/strsim-rs/pull/67 which just improves the failure message, since the current one is not really helpful
  • the benchmark cleanup is something we would want to change. I opened https://github.com/dguo/strsim-rs/pull/66 just cleaning up the unnecessary bits. I will probably change this to use criterion soon though.
  • the documentation will need to be tackled separately, since we certainly need to improve it.

Not sure I missed anything small, but finding it would require this to be rebased which would be a significant amount of work. For this reason I will close this PR.

Thanks again for the work you did put into this. I know the review has taken us quite a while :sweat_smile:

maxbachmann avatar Jan 05 '24 07:01 maxbachmann