hsd
hsd copied to clipboard
Add unicode lint - allow only ASCII symbols
Motivation
Motivation for this PR is recent attack using unicode:
- https://www.trojansource.codes/
- https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-42574
Approach to address the issue varies, recent example:
- https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html - where they go with deny-by-default for the specific unicode symbols.
Another unicode based issue: https://certitude.consulting/blog/en/invisible-backdoor/
We can be even more strict with our lints and this is example of only allowing ASCII symbols and have separate allowed lists that appear out of necessity. This is in case something similar is discovered with different set of codes.
Details
Simplest form of the allowed list check:
grep $'[^\u0020-\u007e\t]' ./**/*.js | grep -v '^\./lib/hd/words/\(french\|italian\|japanese\|spanish\|chinese-traditional\|chinese-simplified\)\.js'
Simple form of the disallowed list check
grep $'[\u202A-\u202E\u2066-\u2069]' ./**/*.js
Future considerations
There are still files that I had to ignore for this:
- Mnemonic words - any future PRs that affect those files (there should not be any though) or adds new language support needs to go through different checks, maybe we can have
disallowed_listchecks for those.
Even though script/unicode-lint.sh is not ignored, if changes are made to this file it needs more careful review what is added to the allowed list or use older unicode-lint to lint new one before executing it.
Other projects
This currently does not include other projects, only thing it checks is node_modules as of now. But we could potentially have this linter in separate repo and add to all other projects as well.
Note
This is little bit hacky version that is pretty slow and we can probably do better! Any contributions or even different implementation all together are welcome.
Pull Request Test Coverage Report for Build 1686961484
- 0 of 0 changed or added relevant lines in 0 files are covered.
- 1205 unchanged lines in 7 files lost coverage.
- Overall coverage increased (+0.2%) to 62.736%
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| lib/covenants/namestate.js | 13 | 89.5% |
| lib/dns/server.js | 38 | 86.53% |
| lib/wallet/rpc.js | 138 | 28.0% |
| lib/wallet/wallet.js | 185 | 74.48% |
| lib/blockchain/chain.js | 253 | 68.09% |
| lib/node/rpc.js | 283 | 30.76% |
| lib/mempool/mempool.js | 295 | 63.15% |
| <!-- | Total: | 1205 |
| Totals | |
|---|---|
| Change from base Build 1521306511: | 0.2% |
| Covered Lines: | 21334 |
| Relevant Lines: | 31814 |
💛 - Coveralls
fun fact: BSD
grepon OSX doesn't work with this script, gotta install GNUgrepwith homebrew...
Oh, I will revert this back to find + grep then, or will check what needs to change. This is just much quicker.
On both Ubuntu 20 and OSX the script is finding "not allowed symbols" in a ton of files... I'm trying to figure out why
On both Ubuntu 20 and OSX the script is finding "not allowed symbols" in a ton of files... I'm trying to figure out why
Still getting these errors on both machines... I'm not sure how to proceed with this. Could be differences in shell versions or something? I'm starting to wonder if we should write a short program in C to scan all text files for unicode bytes...
OK I've got a great little C program for this:
https://github.com/pinheadmz/unicode-comb
It scans everything, including node modules! I've tested it by pasting one of the bidirectional characters into random files.