typos icon indicating copy to clipboard operation
typos copied to clipboard

Hashes/encodings below the heuristic limit are treated as typos

Open halkeye opened this issue 3 years ago • 17 comments

error: `Ba` should be `By`, `Be`
  --> ./content/blog/2017/08/2017-08-08-introducing-jenkins-minute.adoc:39:14
   |
39 | video::FhDomw6BaHU[youtube, width=852, height=480]

I can add it to [default.extend-identifiers] so its not a blocker but figured you'd like another test case.

halkeye avatar Jan 28 '22 21:01 halkeye

The challenge is being able to identify that has a hash. How do we tell a hash from an identifier?

Right now, we support

  • SHA detection: must be 32+ characters long and consistent case
  • Base 64 detection: Must be 90 characters long or have + / / in it and must have the padding bytes (though there is uncertainty if the padding byte requirement will stay, see #413)

epage avatar Jan 28 '22 22:01 epage

yea that makes a lot of sense. I'm just starting to use the app and loving it so far, I've only had to whitelist two hashes that have ba in them so its not a big deal for me.

Maybe some sort of regex or something so I could whitelist video::[a-zA-Z0-9]\[

halkeye avatar Jan 28 '22 22:01 halkeye

I think it causes also some problems with jupyter notebooks.

error: `ba` should be `by`, `be`
  --> jupyter.ipynb:661:11
    |
661 |    "id": "6ba7c279",
    |            ^^
    |
error: `ba` should be `by`, `be`
  --> jupyter.ipynb:784:15
    |
784 |    "id": "33088ba8",
    |                ^^
    |
error: `ba` should be `by`, `be`
  --> jupyter.ipynb:1029:10
     |
1029 |    "id": "ba6788ca",
     |           ^^
     |
error: `ba` should be `by`, `be`
  --> jupyter.ipynb:2029:10
     |
2029 |    "id": "ba638183",
     |           ^^
     |

pums974 avatar Mar 10 '22 17:03 pums974

Hello,

Git commit hashes tend to run in the range [0-9a-fA-F]{7,} so that would be a useful addition.

tspearconquest avatar Jun 15 '22 18:06 tspearconquest

@tspearconquest for shorter git commit hashes, we'll need to rely on a heuristic like talked about in #484 because shorter commit hashes could just as easily be words.

epage avatar Jun 15 '22 18:06 epage

How about adding a heuristic "word contains characters preceded by numbers" (where "word" is a whitespace-separated segment, not a case-separated segment)? I don't think I've ever seen an identitifer be named foo1bar or 3foo, though foo3 or foo3_bar seem realistic (e.g. zip3, zip4).

jplatte avatar Sep 01 '22 07:09 jplatte

sha1hash?

halkeye avatar Sep 01 '22 07:09 halkeye

Right, there's a few exceptions (I also remembered there being 2to3), but maybe it's still a good heuristic? Personally, I consider false positives a bigger issue than false negatives, and I think that matches typos' overall approach.

jplatte avatar Sep 01 '22 07:09 jplatte

There is also all the this2that and thing4stuff

pums974 avatar Sep 01 '22 08:09 pums974

@jplatte I'd probably refine your comment to be "any identifier that exclusively word splits due to numbers and not any other separator (be it case or _)

The next question is the likelihood of a shortened sha having no numbers. I probably didn't bring this up in the other thread talking about heuristics but I suspect to have something always complain than it have it complain in a way people no longer expect.

epage avatar Sep 01 '22 14:09 epage

In the case of a hex string, that would be (10/16) (since 6 out of the 16 possible chars are alphabetical) to the power of the string length. git short hashes are the shortest hash I see in practice, and they seem to start at 7 characters (longer in large repos), which puts the probability of such a hash having no digits at pretty much exactly 0.1%. Strings that contain no digits before any non-digit characters would be closer to 0.5% though (rough estimation, could also be >0.5%, but not <0.28%).

jplatte avatar Sep 01 '22 15:09 jplatte

FYI #695 provides a new workaround for false positives

epage avatar Mar 22 '23 20:03 epage

  --> ./content/n/rust-docker.md:52:44
   |
52 | hello         0.1.0                ac4e1a72ba05   2 minutes ago    1.38GB
   |                                            ^^
   |
error: `ba` should be `by`, `be`
  --> ./content/n/rust-docker.md:53:46
   |
53 | rust          1.52.1-slim-buster   61cb3c65a6ba   3 weeks ago      621MB
   |                                              ^^

The extend-ignore-re solved this issue.

[default]
extend-ignore-re = ["[0-9a-fA-F]{12}"]

I think we can safely close this issue.

azzamsa avatar Jan 28 '24 00:01 azzamsa

@azzamsa that regex is much too generic, it disables spell-checking for all 12-letter identifiers as well.

jplatte avatar Jan 28 '24 07:01 jplatte

Description

The typos pre-commit hook fails on truncated commit hashes in CHANGELOG.md.

Environment

- repo: https://github.com/crate-ci/typos
  rev: v1.20.4
  hooks:
    - id: typos

Actual Behavior

$ pre-commit run --files CHANGELOG.md

typos....................................................................Failed
- hook id: typos
- exit code: 2

error: `ba` should be `by`, `be`
  --> CHANGELOG.md:100:28
    |
100 | - _(README)_ update - ([e84ba3e](https://github.com/DeadNews/firebirdsql-run/commit/e84ba3e8e2f72a8dcad43f8ac3c768527ca199bd))
    |                            ^^
    |
$ pre-commit run --files CHANGELOG.md

typos....................................................................Failed
- hook id: typos
- exit code: 2

error: `ba` should be `by`, `be`
  --> CHANGELOG.md:22:99
   |
22 | - update `mkdocs` config ([#127](https://github.com/DeadNews/encode-utils-cli/issues/127)) - ([c92ba20](https://github.com/DeadNews/encode-utils-cli/commit/c92ba2032ac0b492b390d45c50f7c57c2660df5c))
   |                                                                                                   ^^
   |
error: `ba` should be `by`, `be`
  --> CHANGELOG.md:62:43
   |
62 | - _(renovate)_ use shared config - ([693c3ba](https://github.com/DeadNews/encode-utils-cli/commit/693c3ba58822db45dd06a032ba1ce554db6deaf6))
   |                                           ^^
   |

original: https://github.com/crate-ci/typos/issues/982

DeadNews avatar Apr 04 '24 16:04 DeadNews

ba should be by, be

↑ This ba is in all examples. Maybe add it to the exceptions?

DeadNews avatar Apr 04 '24 16:04 DeadNews