rust-tokenizers icon indicating copy to clipboard operation
rust-tokenizers copied to clipboard

Use AsRef<Path> instead of &str.

Open npatsakula opened this issue 1 year ago • 2 comments

Master PR: #76 Blocked by: #79

Motivation

  1. Not every FS path is a valid utf-8 string.
  2. AsRef<Path> is strictly more acceptable with same level of correctness: you can use &str, Path and PathBuf.

Implementation

  • [x] Use AsRef<Path> instead of &str.
  • [x] Remove unnecessary type casts and unwraps from tests.

@veta666, this one is also on you.

npatsakula avatar Sep 12 '22 13:09 npatsakula

Thank you @npatsakula for this PR.

I would be in favor of actually moving from &str to Path instead of AsRef<Path> for 2 reasons:

1. As you mention, not every filesystem path is a valid utf-8 string. If we decide to take the _path_ of correctness, it makes sense to expect a `Path`. The user can still convert the `&str` it may use to `Path` before passing it to the tokenizers and vocabs.

2. Using `Option` with generics with `AsRef` leads to awkward `None` arguments, requiring the user to pass `None::<Path>` instead of simply `None`. I feel this is a high price to pay for allowing (possibly incorrects) `&str` as path representation.

What do you think? The next few pull requests will introduce breaking changes anyway, so I think using Path makes sense here.

I believe this would be ready to merge before #79 (I am not fully bought in the idea of snafu) so it may make sense to merge this one first and revert the error handling commits on this branch.

  1. Yes, but &str is valid subset of filesystem paths (the reverse is not true). This mean that we obtain richer interface for no cost.
  2. We can use pre-turbofished interface for beginner rust programmers (link).

Now I replaced all AsRef<Path> with Path and broke all comment documentation + you can see how noisy it looks like.

npatsakula avatar Oct 12 '22 14:10 npatsakula

Hello @npatsakula ,

You are right I got confused into assuming that Path was a subset of &str, it does make sense to keep a richer interface. As the library now leverages the specialized new_with_special_token_mapping_path, the issue with the awkward None turbofish becomes even less of an issue.

I apologize - but you convinced me that your original choice was, indeed, the right one. I would be happy to go with AsRef<Path> (after causing some extra work on your end). If you're happy to do so I'd go back to your original implementation with AsRef<Path> but without the error handling rework for now.

I'm happy to help our revert the changes if you'd like me to do so, please let me know.

guillaume-be avatar Oct 15 '22 07:10 guillaume-be

Hello @guillaume-be! I reverted to AsRef<Path> (doc-comments too).

npatsakula avatar Oct 19 '22 17:10 npatsakula

Hello @guillaume-be! I reverted to AsRef<Path> (doc-comments too).

Thanks a lot! looks good to merge - sorry again for the unneeded iterations

guillaume-be avatar Oct 19 '22 17:10 guillaume-be