wren
wren copied to clipboard
Security concern with Unicode bidirectional characters
Code Project highlighted this article in its Daily News bulletin today. Although it's nothing new - the Go community have been aware of it since 2017 - I thought it was worth bringing it to everyone's attention.
The following simple example, which uses the Unicode right-to-left override character (U + 202E), illustrates the concern:
var v = "my-text.\u202ecod.exe"
System.print(v)
This actually prints as my-text.exe.doc making it look like a doc file when in reality it's an exe!
I'm not suggesting we should try and do something about this in Wren itself - I don't know what we could do anyway. We'll just have to leave it to the host compiler and/or tools.
However, one thing is clear. If anyone has any lingering doubts about the wisdom of using unrestricted Unicode identifiers (#948), this is yet another reason why it would be a bad idea.
Rust disallows the use of those codepoints without escaping in code since 1.56.1, which is a fairly good solution IMHO to the problem.
Not sure if my comment is relevant, but I think while it is something to be aware, I think it is first a context usage problem.
While it can have some impact on right to left languages users, I don't think every text context should allow to mix text directions. At minimum every identifiers (and by extension filenames) should not allow them for security reasons, and they should be represented as raw/escaped, not as their utf compliant representation. So basically it is more an editor problem, than a language concern.
It is indeed a trivial solution, though I find it not really civilized one...
Well the simple solutions are often the best ones and, if we were to do something, I think that the Rust solution is well worth considering.
Now that the problem has been publicized, it will be interesting to see whether some sort of consensus emerges amongst the major languages on how best to deal with it.
Well the biggest issue is that because of that Unicode character (and probably a few others like BOM and accent modifiers) it invalidate UTF as being a character encoding and make it a format encoding. So to me, it is more an editor problem. Any sane editor should display formatting modifiers and not render them (or at least be able to allow to switch between modes). At the end of the day, UTF as a binary format is successful, but the interpretation of the information it transport is becoming a failure as time pass.
This is probably right, and editors do (some, at least) provide options to control that, but the reality is that many (like GitHub reviews, for example) don't, and not everyone set these settings. So, the question is: do we want to expose our users to risk?
It's interesting to read here what the Rust team have actually said and are doing about this issue.
With Rust already beginning to nibble at their lunch, the C/C++ standards committees may feel that they should be seen to be doing something about this too.
Apparently there are 9 code-points which are involved with text direction, in two blocks of 5 and 4 but I wonder how many coders actually know this and, if their editor allows it, would examine imported source code to see if they're present? I didn't realize these code-points even existed until the issue came up in Go four years ago.
However, one thing is clear. If anyone has any lingering doubts about the wisdom of using unrestricted Unicode identifiers (#948), this is yet another reason why it would be a bad idea.
Actually, it is not clear. What has that to do with an identifier? You used a string as example, refering to a file name outside of the program. That has nothing to do with identifiers.
The issue here is about allowing more characters for identifiers. Since, since we don't have an UTF-8 library dependency, we can't rely on it to classify characters. So this is mostly the real problem.
I'm going to close this issue as I don't think personally it's something we should try and address from Wren - it's more a problem for the host.
If anyone strongly disagrees, then I'll reopen it again.