pre-commit-hooks Add check for text file encodings

Refs https://github.com/pre-commit/pre-commit-hooks/issues/683

Nov 11 '21 17:11 scop

Don't know if I'm too happy with the platform dependent default, perhaps defaulting to utf-8 would be better. No strong opinions between those.

Nov 11 '21 17:11 scop

Also, incremental decoding instead of read() slurp would be nicer.

Can polish once we establish this will be accepted in the first place.

Nov 11 '21 17:11 scop

From https://github.com/pre-commit/pre-commit-hooks/issues/683#issuecomment-970279641

any decent programming language already presents encoding issues as syntax errors so I'm not sure where it would be useful

Sadly, not all developers run test suites or test their code before pushing code to repos. Not all repos even have test suites for that matter. It's useful to be able to catch errors locally in pre-commit before they hit CI, or even staging/production.

Anyway, I think the opinion what qualifies as a "decent programming language" belongs here.

No matter what one thinks of them, for example C and JavaScript and Perl code can be encoded let's say in ISO-8859-1 without any encoding markers. I don't think there even is a way to convey that at least for C code, not sure for JS; for Perl there is use utf8; but I don't know if something similar exists for other encodings, and there might be reasons why one wants to have files in some particular, other encoding. And such files compile and run just fine.

Chances are that the ISO-8859-1 in them is not the intention though and they now don't behave correctly, perhaps one wanted UTF-8 instead. Or perhaps one wanted code that is not that brittle wrt source encoding in the first place and for that reason wants to restrict everything to ASCII to avoid bugs of this class altogether.

Or maybe it's not a runtime issue, one just wanted some comments/API docs in them in UTF-8 but a badly configured editor did it in ISO-8859-1 or vice versa, and now docs are not being generated right. This check would help.

From https://github.com/pre-commit/pre-commit-hooks/issues/683#issuecomment-970334744

I understand but the vast majority of programming is writing code so I don't think it's that useful

What I'm trying to say is that just like pre-commit, this check can be useful for repos that don't have anything whatsoever to do with programming. Websites built with markdown, static HTML + CSS, other text content, including plain text, for whatever purpose.

Nov 16 '21 21:11 scop

any decent programming language already presents encoding issues as syntax errors so I'm not sure where it would be useful

Sadly, not all developers run test suites or test their code before pushing code to repos. Not all repos even have test suites for that matter. It's useful to be able to catch errors locally in pre-commit before they hit CI, or even staging/production.

yeah agreed, but I don't think it's our place to supply those checks when a language-specific tool can provide better diagnostic SyntaxErrors when it's a problem for them

my main concerns with this are:

it's not something I would ever see myself using, or recommending to others
it should default to UTF-8 and not platform specific
it's being proposed as a solution to bidi / homoglyphs / etc. of which I strongly believe it is not a proper solution to this
I think this particular "use X encoding for Y language" is better solved in language-specific tooling
if we accept this it'll be something more to maintain, I can already see the "please read my emacs comment headers and use that encoding" request which I really don't have an interest in supporting

Nov 16 '21 21:11 asottile

Some counterarguments:

No problem changing the default, UTF-8 probably is the better choice.
We can forget about the #683 considerations, and people who know what they're doing can do what they want with this.
I don't think language specific tooling is really possible for a variety of cases this applies to.
There already is a bunch of checks here that hardcode an encoding or do the right thing with some specific ones only, and there's no support for those comments headers or editorconfigs or whatnot in them either.

Nov 16 '21 22:11 scop

This would be useful in checking any kind of text files are encoded in the desired user-configurable encoding. Regardless of whether it treats or mitigates the CVE, this has value in tons of repositories.

Also, language-specific tooling means a lot of different things to a lot of people. I use vim and write a lot of terraform code and yaml for kubernetes deployments.

At least for Terraform, the cli will error out if, for example, you read a file with the file() function and that file "contains invalid UTF-8 sequences." However, that's not really what I'd call "language-specific tooling" because it depends on the user actually trying to run terraform apply and it operates on files that may exist locally in a CI system or the user's local filesystem where the terraform command is running, but external to the repo itself. So that doesn't confirm that the actual HCL config files for terraform are encoded in UTF8.

What I need is exactly this pre-commit hook to prevent files that have any encoding other than UTF-8 from being committed in the first place.

Respectfully, I request the addition of this hook.

Jan 29 '22 17:01 tspearconquest

pre-commit-hooks pre-commit-hooks copied to clipboard

Add check for text file encodings

pre-commit-hooks
pre-commit-hooks copied to clipboard