stdlib icon indicating copy to clipboard operation
stdlib copied to clipboard

Prevent control codes actioning in string.inspect

Open Michael-Mark-Edu opened this issue 1 year ago • 13 comments
trafficstars

Fixes #600

Unfortunately, Gleam doesn't recognize \b or \v as valid escape codes, which I believe makes it impossible to create a test for these changes. I'll likely open a new issue about making those valid escapes.

It should also be noted that some escape codes are "invisible" but still count for string equality, resulting in the amusing bug of "abc123" != "abc123" because of null characters in one of the strings. However, this is arguably intentional behavior and I'll let a maintainer decide if anything should be done about that.

Michael-Mark-Edu avatar May 22 '24 23:05 Michael-Mark-Edu

Erlang also recognizes \e (ASCII code 27), should that also be added? (And potentially bell (\a) (ASCII code 7)--maybe not, Erlang doesn't use this one, but some languages do.

mooreryan avatar May 23 '24 01:05 mooreryan

Oh right, \e. I tested it and found that it was actioning, but I forgot to patch it in the commit. Going to add a new commit now

Michael-Mark-Edu avatar May 23 '24 01:05 Michael-Mark-Edu

@lpil alright, tests have been added (using \u) and they are passing. You can try to comment out the lines added to gleam_stdlib.erl and the tests will fail.

Michael-Mark-Edu avatar May 23 '24 19:05 Michael-Mark-Edu

I added code that generalizes everything <32 that isn't already explicitly on the list. Also changed it to use \u{xxxx} syntax.

Michael-Mark-Edu avatar May 25 '24 00:05 Michael-Mark-Edu

I'm not necessarily saying that characters from [0, 31] should not be escaped in this way, but I do think it makes it a bit confusing why other control characters (eg some of these) are not escaped in the same way.

The previous behavior could be explained by saying, "we only escape valid gleam control character syntax" or something like that, and I suppose this behavior could be explained by stating that ascii control characters are now also escaped, but it still opens the question as to why the other control characters are not escaped. (See the comment here: https://github.com/gleam-lang/stdlib/issues/600#issuecomment-2130343645).

Regardless, I think it bears discussing at least.

mooreryan avatar May 25 '24 00:05 mooreryan

Does UTF-8 have escape characters beyond outside of 7-bit ASCII range? I've been operating on the assumption that everything after 127 is unicode and that no escape characters exist inside the unicode set.

I'm looking at UTF-8 character sets right now, and it seems 0x80-0x9F may be escape codes that I didn't realize existed. If they are, then they should probably be added to the \u list.

Michael-Mark-Edu avatar May 25 '24 01:05 Michael-Mark-Edu

Alright, it uses io_lib:format now. Also, control characters 128-159 are now being handled.

Michael-Mark-Edu avatar May 25 '24 02:05 Michael-Mark-Edu

Does UTF-8 have escape characters beyond outside of 7-bit ASCII range? I've been operating on the assumption that everything after 127 is unicode and that no escape characters exist inside the unicode set.

I'm looking at UTF-8 character sets right now, and it seems 0x80-0x9F may be escape codes that I didn't realize existed. If they are, then they should probably be added to the \u list.

I'm not sure if they are generally "escaped" by languages (I think not in the sense of \n or something), but they would probably be better represented as \u{XXXX}.

mooreryan avatar May 25 '24 02:05 mooreryan

I'm not sure if they are generally "escaped" by languages (I think not in the sense of \n or something), but they would probably be better represented as \u{XXXX}.

I think they should. According to the website you linked, the codes between 0-31 and 127-159 consist of exactly all the control codes in UTF-8. I haven't tested them, but I presume the second set of codes has the same effects of the first (invisible graphemes, actioning) and so should be handled with in the same way too.

Michael-Mark-Edu avatar May 25 '24 02:05 Michael-Mark-Edu

I think they should.

I think we are saying the same things, but using different terms.

For example, I mean the unicode codepoint U+000D can be represented by this escape sequence \u{000D} as well as this one \r. When you mentioned "Does UTF-8 have escape characters beyond outside of 7-bit ASCII range?" I thought you were meaning does there exist other characters commonly represented by the "classic" c-style escape sequences like \r, \t, etc, that are outside of the range of 0-127. To that, I'm not sure, but there are definitely control characters that probably most naturally would be good to represent with the \u{XXXX} style escape sequences. (Whether that belongs in this PR is potentially a separate question.)

The other thing I'm interested in is the fact there there are many more "invisible" characters way outside the normal ascii range (eg the En Space. Should these be printed as-is, or should the also be translated to escape sequences? And what about the example of the "e" with accent? (There are many like that as well.) What I'm not sure of is how far to go, so to speak, with the escaping, either in this PR, future PRs, or maybe beyond the scope of the stdlib in a separate library.

mooreryan avatar May 25 '24 03:05 mooreryan

When you mentioned "Does UTF-8 have escape characters beyond outside of 7-bit ASCII range?" I thought you were meaning does there exist other characters commonly represented by the "classic" c-style escape sequences like \r, \t, etc, that are outside of the range of 0-127.

I was referring to the mere presence of control characters outside of the 7-bit range. I wasn't thinking about if any of them were common enough to have a C-escape sequence (according to Wikipedia none of them do).

The other thing I'm interested in is the fact there there are many more "invisible" characters way outside the normal ascii range (eg the En Space.

As of now, I'd argue to keep them because the UTF-8/Unicode standard defines those characters to be the way they are and to be distinct, even if I personally disagree with the standard. It's unlike the invisible graphemes where it's impossible to detect them without viewing binary data.

Potentially, there could be a configuration flag to gleeunit to "ASCII-ify" debug output and escape all characters that are >127. This would be optional because it limits functionality, but it would be potentially useful as an optional flag. What this flag would actually be (CLI option? gleeunit.main() parameter? gleam.toml?) isn't something I'm sure of, but I think it's an idea worth considering.

Michael-Mark-Edu avatar May 25 '24 03:05 Michael-Mark-Edu

Could we please get a workflow approval for this PR? I want to see ~~green~~ nvm the JS test failed.

Michael-Mark-Edu avatar May 25 '24 03:05 Michael-Mark-Edu

Looks like you have a failing test when running on the JavaScript target

lpil avatar May 25 '24 10:05 lpil

Looks like JavaScript target does things differently when it comes to escape codes compared to Erlang. This seems to arise from the additional tests I added, as the only modifications I made to the original source were to the .erl file.

The specific failure of the JavaScript test was "\"\\b\"" should equal "\"\\u{0008}\"" so it seems the JavaScript target is automatically converting \u{0008} into \b. \b is not anywhere in the test file so this must be JavaScript's doing. Since \b is not valid Gleam syntax, this seems like a unrelated bug with the JavaScript target.

Since this PR was originally meant to fix a bug with the Erlang target, I am unsure if patching this JavaScript bug is within the scope of this PR.

Michael-Mark-Edu avatar May 25 '24 23:05 Michael-Mark-Edu

I did some grepping in the stdlib and couldn't find any mention of \b related to JavaScript, so I think this might be functionality related to JavaScript/NodeJS. If so, then we will likely have to override whatever function is generating the \b on the JavaScript side.

Michael-Mark-Edu avatar May 25 '24 23:05 Michael-Mark-Edu

It also seems that unrecognized unicode gets converted into \uxxxx instead of \u{xxxx} format when targeting JavaScript, so none of the other tests I added should be passing either.

Michael-Mark-Edu avatar May 25 '24 23:05 Michael-Mark-Edu

Going down the chain of nested function calls, I eventually came across JSON.stringify(v) which seems to be where the \b and \u000B are coming from. Testing it in the node CLI, I'm getting the non-Gleam escape sequences. I think we may need to replace JSON.stringify(v) with our own function here if we don't want parity issues between JavaScript and Erlang, or JavaScript producing invalid Gleam strings.

Michael-Mark-Edu avatar May 26 '24 00:05 Michael-Mark-Edu

Alright, I bit the bullet and decided to do the JS implementation myself. Now, the output of string.inspect should be the same between Erlang and JS.

I'm not experienced with JS at all, so I am requesting code review on commit 1cec04a.

Michael-Mark-Edu avatar May 26 '24 03:05 Michael-Mark-Edu

@lpil this PR should be ready for another workflow test (doing the commands in the workflow on local, everything passes) and code review.

Michael-Mark-Edu avatar May 27 '24 00:05 Michael-Mark-Edu

Sorry, you'll need to rebase on main to have the tests run due to the merge conflict

lpil avatar May 29 '24 10:05 lpil

Alright, merge conflict should be resolved. Seems like git got confused and triggered a conflict even though the changelog didn't have any serious conflicts.

Michael-Mark-Edu avatar May 29 '24 10:05 Michael-Mark-Edu

It seems like some upstream changes are causing gleam format --check to fail on unrelated files through no fault of my own. I'm on my phone right now so I can't commit a format commit right now.

Michael-Mark-Edu avatar May 29 '24 10:05 Michael-Mark-Edu

Hmm I'm not sure why rebasing the branch to remove the merge commit closed the PR. Perhaps something to do with it being called main? I think I made a mistake somewhere. I'll rebase these onto main manually, thank you

lpil avatar May 29 '24 11:05 lpil