Heuristics for "random" values to help with base-encoded typo false positives
This is a base58-encoded string from our codebase, is there some heuristic for the typo checker to not consider such a long "random" string to not be a word and not suggest anything for it? This was part of a larger JSON string in a test.
error: `Wew` should be `We`
--> ./desc.rs:200:49
|
200 | "bytes_cid": "z177xERgbqgBdC97Y5GYXZWew1cFgkttqr5ipF2b8iCN17",
| ^^^
|
Here is another similar one also from a embedded JSON string:
error: `nd` should be `and`
--> ./test.rs:31:221
|
31 | pub const JWK: &str = r#"{"alg":"sig","n":"wnI2iD6F7qAg0qKGpFQ6L7qYdGbPkHSUHzigaW3p89fWBbZRT-WawqdU4vu3vANL9whlXMGlzLsPNUwXsoDKu6CnzAUUO9pr7E6CukN9A1UN13L-ZRKHAGv33NkdygDpTsYXUVAoQLykPnjToNVDKA0ohy96kzPkT4vql9n_5ev7Dhy69nd79mI09QhHo62RGzZDDanjdjXRBLBFA3Hm-CKiu"]}"#;
| ^^
|
These are the last two major false positives we've been seeing in our codebase with typos, works really well otherwise!
Yes, we have several issues related to hashes / base encodings of some sort
- https://github.com/crate-ci/typos/issues/401
- https://github.com/crate-ci/typos/issues/413
- https://github.com/crate-ci/typos/issues/415
Having some kind of heuristic to discard hashes / base-encodings beyond a strict syntax check would be a big help. What that'd look like is the question though. To start off brainstorming,
- X numbers (groups of digits) in string
- X "words" (groups of letters) shorter than Y characters
- We probably can treat base encoding equally with hashes (ie no special heuristics for how "much" of a word exists between
-, protecting against math between variables) as we can identifiers in math will just show up somewhere else in the code and get flagged
Any other ideas for heuristics and for what the Xs and Ys should be?
https://github.com/crate-ci/typos/issues/316 has a list of alternative approaches. Feel free to share how useful or not those approaches would be in that issue.
think the most important would be to have a way to opt out of tricky situations, like you may want to have a text string that has typos in it included in a test code or similar, and there will be cases that are difficult to detect properly with these type of base-encoded numbers of JSON strings and other stuffs.
So having some solution like the ones in #316 to opt out would be great robust fallback. For our particular use cases having a way to disable handling through comments that enable/disable the spell check would work