LangSec compliance
@howardwu Also tagging @bendyarm, who first brought up the Unicode bidi override issue for Leo a while ago.
This is my assessment for ASCII:
- 32 (space) is fine.
- 33-126 (visible chars) are fine.
- 10 (line feed) and 13 (carriage return) are fine.
- 9 (horizontal tab), 11 (vertical tab), 12 (form feed) should be fine, but I see no good reason to allow them.
- The others should be disallowed, particularly 8 (backspace — see above) and 127 (delete).
For Unicode, the bidi overrides that this PR excludes are well-known dangers, but I don't know offhand if there may be possible issues with others. I believe we should make an assessment for all non-ASCII Unicode characters (well, by ranges, not one by one 😄) similarly to the above list for ASCII, and only allow what we have good reasons to allow.
This is an instance of the general security principle to prefer whitelists to blacklists, i.e. thinking explicitly of what we want to allow and why, rather than saying "anything but...", since the latter may lead to unexpected vulnerabilities, cf. CWEs such as this general one.
While we are on this topic, I'd like to advocate a LangSec (= language-theoretic security) approach (also mentioned in the above CWE) for our handling of all inputs to our tools, not just for "proper" languages like Leo and Aleo instructions, but also for other input data. We use 3rd-party crates for parsing things like TOML and YAML, which I think is fine for expediency, but I'm not sure how well those crates follow LangSec principles (maybe they do, but it's something we should look into).
Originally posted by @acoglio in https://github.com/AleoHQ/snarkVM/issues/975#issuecomment-1218260751
As a follow-up, the above should apply to all our parsers of text files: code files, input files, configuration files, etc. Every input to the system should be thoroughly checked.