pre-commit-hooks
pre-commit-hooks copied to clipboard
Add check for text file encodings
Refs https://github.com/pre-commit/pre-commit-hooks/issues/683
Don't know if I'm too happy with the platform dependent default, perhaps defaulting to utf-8 would be better. No strong opinions between those.
Also, incremental decoding instead of read()
slurp would be nicer.
Can polish once we establish this will be accepted in the first place.
From https://github.com/pre-commit/pre-commit-hooks/issues/683#issuecomment-970279641
any decent programming language already presents encoding issues as syntax errors so I'm not sure where it would be useful
Sadly, not all developers run test suites or test their code before pushing code to repos. Not all repos even have test suites for that matter. It's useful to be able to catch errors locally in pre-commit before they hit CI, or even staging/production.
Anyway, I think the opinion what qualifies as a "decent programming language" belongs here.
No matter what one thinks of them, for example C and JavaScript and Perl code can be encoded let's say in ISO-8859-1 without any encoding markers. I don't think there even is a way to convey that at least for C code, not sure for JS; for Perl there is use utf8;
but I don't know if something similar exists for other encodings, and there might be reasons why one wants to have files in some particular, other encoding. And such files compile and run just fine.
Chances are that the ISO-8859-1 in them is not the intention though and they now don't behave correctly, perhaps one wanted UTF-8 instead. Or perhaps one wanted code that is not that brittle wrt source encoding in the first place and for that reason wants to restrict everything to ASCII to avoid bugs of this class altogether.
Or maybe it's not a runtime issue, one just wanted some comments/API docs in them in UTF-8 but a badly configured editor did it in ISO-8859-1 or vice versa, and now docs are not being generated right. This check would help.
From https://github.com/pre-commit/pre-commit-hooks/issues/683#issuecomment-970334744
I understand but the vast majority of programming is writing code so I don't think it's that useful
What I'm trying to say is that just like pre-commit, this check can be useful for repos that don't have anything whatsoever to do with programming. Websites built with markdown, static HTML + CSS, other text content, including plain text, for whatever purpose.
any decent programming language already presents encoding issues as syntax errors so I'm not sure where it would be useful
Sadly, not all developers run test suites or test their code before pushing code to repos. Not all repos even have test suites for that matter. It's useful to be able to catch errors locally in pre-commit before they hit CI, or even staging/production.
yeah agreed, but I don't think it's our place to supply those checks when a language-specific tool can provide better diagnostic SyntaxErrors when it's a problem for them
my main concerns with this are:
- it's not something I would ever see myself using, or recommending to others
- it should default to UTF-8 and not platform specific
- it's being proposed as a solution to bidi / homoglyphs / etc. of which I strongly believe it is not a proper solution to this
- I think this particular "use X encoding for Y language" is better solved in language-specific tooling
- if we accept this it'll be something more to maintain, I can already see the "please read my emacs comment headers and use that encoding" request which I really don't have an interest in supporting
Some counterarguments:
- No problem changing the default, UTF-8 probably is the better choice.
- We can forget about the #683 considerations, and people who know what they're doing can do what they want with this.
- I don't think language specific tooling is really possible for a variety of cases this applies to.
- There already is a bunch of checks here that hardcode an encoding or do the right thing with some specific ones only, and there's no support for those comments headers or editorconfigs or whatnot in them either.
This would be useful in checking any kind of text files are encoded in the desired user-configurable encoding. Regardless of whether it treats or mitigates the CVE, this has value in tons of repositories.
Also, language-specific tooling means a lot of different things to a lot of people. I use vim and write a lot of terraform code and yaml for kubernetes deployments.
At least for Terraform, the cli will error out if, for example, you read a file with the file()
function and that file "contains invalid UTF-8 sequences." However, that's not really what I'd call "language-specific tooling" because it depends on the user actually trying to run terraform apply
and it operates on files that may exist locally in a CI system or the user's local filesystem where the terraform command is running, but external to the repo itself. So that doesn't confirm that the actual HCL config files for terraform are encoded in UTF8.
What I need is exactly this pre-commit hook to prevent files that have any encoding other than UTF-8 from being committed in the first place.
Respectfully, I request the addition of this hook.