wren icon indicating copy to clipboard operation
wren copied to clipboard

Security concern with Unicode bidirectional characters

Open PureFox48 opened this issue 3 years ago • 7 comments

Code Project highlighted this article in its Daily News bulletin today. Although it's nothing new - the Go community have been aware of it since 2017 - I thought it was worth bringing it to everyone's attention.

The following simple example, which uses the Unicode right-to-left override character (U + 202E), illustrates the concern:

var v = "my-text.\u202ecod.exe"
System.print(v)

This actually prints as my-text.exe.doc making it look like a doc file when in reality it's an exe!

I'm not suggesting we should try and do something about this in Wren itself - I don't know what we could do anyway. We'll just have to leave it to the host compiler and/or tools.

However, one thing is clear. If anyone has any lingering doubts about the wisdom of using unrestricted Unicode identifiers (#948), this is yet another reason why it would be a bad idea.

PureFox48 avatar Nov 02 '21 18:11 PureFox48

Rust disallows the use of those codepoints without escaping in code since 1.56.1, which is a fairly good solution IMHO to the problem.

ChayimFriedman2 avatar Nov 02 '21 23:11 ChayimFriedman2

Not sure if my comment is relevant, but I think while it is something to be aware, I think it is first a context usage problem.

While it can have some impact on right to left languages users, I don't think every text context should allow to mix text directions. At minimum every identifiers (and by extension filenames) should not allow them for security reasons, and they should be represented as raw/escaped, not as their utf compliant representation. So basically it is more an editor problem, than a language concern.

mhermier avatar Nov 02 '21 23:11 mhermier

It is indeed a trivial solution, though I find it not really civilized one...

mhermier avatar Nov 03 '21 00:11 mhermier

Well the simple solutions are often the best ones and, if we were to do something, I think that the Rust solution is well worth considering.

Now that the problem has been publicized, it will be interesting to see whether some sort of consensus emerges amongst the major languages on how best to deal with it.

PureFox48 avatar Nov 03 '21 01:11 PureFox48

Well the biggest issue is that because of that Unicode character (and probably a few others like BOM and accent modifiers) it invalidate UTF as being a character encoding and make it a format encoding. So to me, it is more an editor problem. Any sane editor should display formatting modifiers and not render them (or at least be able to allow to switch between modes). At the end of the day, UTF as a binary format is successful, but the interpretation of the information it transport is becoming a failure as time pass.

mhermier avatar Nov 03 '21 07:11 mhermier

This is probably right, and editors do (some, at least) provide options to control that, but the reality is that many (like GitHub reviews, for example) don't, and not everyone set these settings. So, the question is: do we want to expose our users to risk?

ChayimFriedman2 avatar Nov 03 '21 08:11 ChayimFriedman2

It's interesting to read here what the Rust team have actually said and are doing about this issue.

With Rust already beginning to nibble at their lunch, the C/C++ standards committees may feel that they should be seen to be doing something about this too.

Apparently there are 9 code-points which are involved with text direction, in two blocks of 5 and 4 but I wonder how many coders actually know this and, if their editor allows it, would examine imported source code to see if they're present? I didn't realize these code-points even existed until the issue came up in Go four years ago.

PureFox48 avatar Nov 03 '21 09:11 PureFox48

However, one thing is clear. If anyone has any lingering doubts about the wisdom of using unrestricted Unicode identifiers (#948), this is yet another reason why it would be a bad idea.

Actually, it is not clear. What has that to do with an identifier? You used a string as example, refering to a file name outside of the program. That has nothing to do with identifiers.

aosenkidu avatar Jan 31 '23 03:01 aosenkidu

The issue here is about allowing more characters for identifiers. Since, since we don't have an UTF-8 library dependency, we can't rely on it to classify characters. So this is mostly the real problem.

mhermier avatar Jan 31 '23 06:01 mhermier

I'm going to close this issue as I don't think personally it's something we should try and address from Wren - it's more a problem for the host.

If anyone strongly disagrees, then I'll reopen it again.

PureFox48 avatar Mar 05 '23 12:03 PureFox48