Selection of boundaries for the default implementation of `is_case`
@rutrum Hi! Firstly, thanks for such a useful crate!
I have found the following example which IMO has a not intuitive result.
#[test]
fn test_upper_snake() {
use convert_case::{Case, Casing};
assert!("UPPER_CASE_WITH_DIGIT1".is_case(Case::UpperSnake));
}
This test fails, even though UpperSnake case has only a single boundary _ and the string is in capital letters.
This happens because the implementation of is_case initializes the StateConverter with Boundary::defaults() which includes UpperDigit boundary.
Is this by design, or a bug?
If a bug, would the correct implementation of is_case be something like this?
fn is_case(&self, case: Case) -> bool {
&self.from_case(case).to_case(case) == self
}
Hi @r-bk . Thanks for reaching out. I'm glad you like the crate.
I agree with you, I think this is a non-intuitive result. I thought your suggestion made sense, but it failed some tests. Take the following:
assert!(!"kebab-case-string".is_case(Case::Snake))
Only splitting based on underscore in this case would be counter to the intent. We try generously to split on all boundaries, so we return false when we join back with the case-specific delimiter.
I think the unexpected part of this code is that UpperDigit boundary is being used by default. Digits are still an ongoing problem I'm trying to formalize (see http://stringcase.org, site WIP). Depending on context, they are sometimes boundaries, sometimes not.
An example brought to my attention from another use was "2d_transformation". This is case of digit first, then letter. In this case, I wouldn't want to split here. But for "Transformation2D" I might split on "n2". But then, the unlikely scenario of "TRANSFORMATION2D"...this doesn't feel common, and I don't even think it's obvious that you'd want to split on UpperDigit all the time.
I think I'll remove UpperDigit from the list of defaults. Do you think that makes sense? Should another digit boundary be removed from the defaults?
Hi @rutrum ! Thanks for the response and the explanation why my snippet doesn't work in all cases.
There are no easy choices :) So I am thinking loud here.
Intuitively removing LowerDigit too (for consistency) seems like a good idea.
But the example that you gave with Transformation2D kills it.
Maybe the definition of Acronym should be expanded in a way that would allow removal of LowerDigit too? But then there is an example of json2text. Now it becomes json_2_text, and it would become json2_text (which is wrong). But should it be split at all?
Specifically to fix the is_case function, maybe instead of splitting and comparing to self
it should (a) check if self conforms to case without splitting it and (b) check that self doesn't contain "other" delimiters ? Are there other conditions that must be checked?
I think you are correct, this isn't an issue of default boundaries (that might be a separate issue, however.) It's really that the mechanism of performing the transformation isn't good enough. This function would actually require manual checks.
I agree with your outline. The is_case function should first check what delimiters are present. If they are the correct delimiters for a certain case, then they can be used to split the identifier into words. Then, each word can be checked to see if they follow a particular "word case", which might be all lower, upper then lower, or all upper.
As I get back into this project, I will put this item on the todo list of improvements to make. I don't know if it will make it to 0.7, but this is clearly something that should be fixed. Thanks for bringing it to my attention.
Hi @r-bk . In preparation for a 1.0.0 release (I think) I'm working on fixing is_case. I think I've come up with an algorithm that works intuitively, and I wanted to share to see if you could crack it.
It's simple: first, we just remove all digits from the string. This means that whatever digits are present, whether they are intended to be at the border of a word or within a word, shouldn't change the interpretation. Then, we convert into the case. We compare the digitless string with the converted string. In code:
fn is_case(&self, case: Case) -> bool {
let digitless = self
.as_ref()
.chars()
.filter(|x| !x.is_ascii_digit())
.collect::<String>();
digitless == digitless.to_case(case)
}
This works for all the exceptions described above, as well as your original example. Do you have any ideas on how this might produce non-intuitive results?
Hi @rutrum ,
If we remove all digits from is_case, does UpperDigit (or any other boundary involving digits) have any further meaning?
My view so far was that is_case, instead of trying to convert the string to checked case, should check that the string doesn't violate case rules.
I don't know exactly which "snake_case standard" you'd like to implement, but if it's Rust RFC 430 I found another corner case.
In snake_case or SCREAMING_SNAKE_CASE, a "word" should never consist of a single letter unless it is the last "word". So, we have btree_map rather than b_tree_map, but PI_2 rather than PI2.
Currently d2 is not considered snake_case (d_2) but if you want to match that standard, d2 should be proper snake_case and d_2 should be considered improper.
@r-bk You're absolutely right. If is_case is going to act on the StateConverter struct (the internal struct that holds the state while you build the conversion with from_case) then it should use the boundaries defined by it. Filtering by digits would be the wrong design pattern.
@notpeter I'm not trying to enforce edge cases imposed by certain style guides. I want the behavior to follow a simple set of rules that are indifferent to the readability of the final result. Another example of this: some languages define when to capitalize acronyms or not. Kotlin's style guide says to uppercase 2-letter acronyms in Pascal case, but capitalize for 3+-letter acronyms: https://kotlinlang.org/docs/coding-conventions.html#choose-good-names
When using an acronym as part of a declaration name, follow these rules:
- For two-letter acronyms, use uppercase for both letters. For example, IOStream.
- For acronyms longer than two letters, capitalize only the first letter. For example, XmlFormatter or HttpInputStream.
Luckily, the library does allow for customization. In this case, you can define a custom Pattern that detects for single-letter words and changes the behavior.
Thinking about this more, the problem with this feature seems to always concern digits. I think I'm coming around to the idea that we don't split on digits by default, and users can opt into considering those boundaries if they want. It's easiest to just say "digits are ignored" and then allow users to use the .add_boundaries/.set_boundaries as needed.
As far as the is_case API, I'm considering removing it altogether. Its a shallow wrapper that seems to be harder to explain than just have the implementer perform the equality themselves.
Ideally, such an is_case function would be very clever in how it evaluates a string, going left to right and asserting that some conditions are never met, or else we short-circuit and declare "this is not snake case". Because of how boundaries are defined (that is, they are very flexible) I don't think its possible to do this most-optimal implementation. Not the mention, the library explicitly ignores most symbols, recently ignored trailing, leading, and double delimiters, and I'm now suggesting ignoring digits by default. The library can convert many strings into snake case but most of them won't look like it.
Lastly, I think the RFC 430 comment brings to light that a lot of users of Rust might be expecting this behavior, so I'll add an example of how you can implement this behavior using the Pattern struct. I might even add the implementation to the convert-case-extras library that was recently created, meant for more-specific functionality.
Let me know if anyone here has more ideas, I'm hoping to release a stable 1.0.0 API very soon.