stdlib Escape backspace etc graphemes in string.inspect

trafficstars

Fixed by #602

Edited 5/24, 8:00 PM PST

I was having issues related to accidentally interpreting binary data as text, resulting in the following error output from gleeunit:

Eventually, I figured out that there's a 0x08 character in there, which is causing the first " to get deleted, then a 0x40 (@) immediately following it takes its place, resulting in strange behavior that looked just close enough to intentional to stump me for a bit.

Ideally, 0x08 should not be doing anything here, since it can easily destroy the output and make it unreadable. Other ASCII escape character like 0x0B (escape) have similar effects here. This could be done by having io.print-like functions replace these sequences with hex (0x08) or escape codes (\b, \u{0008}).

This is most severe for gleeunit, where the developer depends on its output to properly debug their program, but I was also having this happen with io.debug and io.print as well (where this behavior might be intentional).

Theoretically, to recreate this you just need to write 0x08 to a file, load it (such as with the file_streams package), then get it to print to screen (either by io.debug or using gleeunit should.equal to get it to print on test fail) and weird things will happen. I imagine having several 0x08s in the string will cause even more damage to the output.

May 22 '24 10:05 Michael-Mark-Edu

Thank you

May 22 '24 20:05 lpil

Here are all the escape codes that are relevant to this: gleam_issue600

(code 127 is in the "invisible" category, codes 128+ are UTF-8 and shouldn't have special functionality)

May 22 '24 22:05 Michael-Mark-Edu

This also seems to be an Erlang-specific issue, since Javascript just turns everything into unicode

May 22 '24 22:05 Michael-Mark-Edu

Shouldn't form feed also be green in your list? It seems to escape fine:

import gleam/io
import gleam/list
import gleam/string

fn cp(n: Int) -> UtfCodepoint {
  let assert Ok(cp) = string.utf_codepoint(n)
  cp
}

pub fn main() {
  [34, 92, 13, 10, 9, 12]
  |> list.map(cp)
  |> string.from_utf_codepoints
  |> string.inspect
  |> io.println
}

// prints: "\"\\\r\n\t\f"

See the erlang code here: https://github.com/gleam-lang/stdlib/blob/fe517817bd4a58904e67fd462de817fccaaba92e/src/gleam_stdlib.erl#L490

May 22 '24 22:05 mooreryan

I suppose fixing may be as simple as adding more control characters that should potentially be escaped there in that function.

May 22 '24 22:05 mooreryan

I opened a PR that should hopefully fix this issue. Null characters affecting string comparison is untouched because it's arguably intentional behavior.

May 22 '24 23:05 Michael-Mark-Edu

@mooreryan The \f patch is unreleased, and I originally did testing on the public build. I tested it lightly in the unreleased build and the \f patch does seem to work.

May 22 '24 23:05 Michael-Mark-Edu

I'm not sure what you mean by the \f patch. The pull request you link doesn't change behavior with the form feed character, as it is already properly escaped in the code.

Edit: oh, I think you mean this patch.

May 23 '24 00:05 mooreryan

This comment (https://github.com/gleam-lang/stdlib/pull/602#discussion_r1613688568) has me thinking, should string.inspect handle more of the first 32 non-printable ascii characters?

The pull request #602 adds handling for \b \v and \e, however, it may be useful to also show the Gleam escape syntax for other non-printable characters. Going back to the original motivating example, if more of the non-printable characters were handled by string.inspect, then the diff would look something like this:

expected: Ok("abc123")
     got: Ok("\u{0008}abc123")

which seems more helpful.

May 24 '24 17:05 mooreryan

Oh that's clever. We could use that syntax for any invisible grapheme- is that what you're saying? How would we identify which ones are invisible?

May 24 '24 18:05 lpil

I made a chart with the invisible ones earlier in this thread. Theoretically, anything not being converted and has a value <32 is invisible (and 127). The conversion list just does the first enter found (or maybe not, I don't know Erlang well), so maybe we can just add a conversion rule to <32 at the end of the list. And 127 too.

I prefer \b over \u{08} when possible, but I don't think a lot of these invisible ones have a well defined C escape so it may be inevitable.

May 24 '24 20:05 Hyperion-21

Oh that's clever. We could use that syntax for any invisible grapheme- is that what you're saying? How would we identify which ones are invisible?

Yes that would be the general idea. The "simplest" solution may be as @Hyperion-21 says, and just convert the beginning of the ascii table. However, the point of identifying which are invisible is trickier than just taking ascii values < 32. For one thing, there are many "non printing" things outside of that range when you consider unicode...check it out:

import gleeunit
import gleeunit/should

pub fn main() {
  gleeunit.main()
}

pub fn a_test() {
  let x = "a b"
  let y = "a\u{0020}b"

  should.equal(x, y)
}

pub fn b_test() {
  let x = "a b"
  let y = "a\u{00A0}b"

  should.equal(x, y)
}

pub fn c_test() {
  let x = "a\u{0020}b"
  let y = "a\u{00A0}b"

  should.equal(x, y)
}

which yields:

Failures:

  1) invisible_chars_test.b_test
     Values were not equal
     expected: "a b"
          got: "a b"
     output: 

  2) invisible_chars_test.c_test
     Values were not equal
     expected: "a b"
          got: "a b"
     output:

Those all look like spaces, but they're not the same. So, the "ideal" string.inspect function may somehow account for that. But it is getting trickier, and so maybe should be left to some 3rd party library? (not sure about that).

Second, you could imagine going beyond "invisible" characters. Check out this classic example:

pub fn e_accent_test() {
  let e1 = "\u{00E9}"
  let e2 = "\u{0065}\u{0301}"

  should.equal(e1, e2)
}

and that yields this:

  3) invisible_chars_test.e_accent_test
     Values were not equal
     expected: "é"
          got: "é"
     output:

Which both look like the same e with accute accent.

Both of the outputs shown in those three failures could be considered pretty unhelpful, and worth treating, but, it is complicated, so I'm not sure how complex the string.inspect should be. It should probably be examined what some common other languages do.

My point the semantics of string.inspect could get tricky, and I'm not sure how far the escaping should be taken, though it could be potentially useful.

I prefer \b over \u{08} when possible, but I don't think a lot of these invisible ones have a well defined C escape so it may be inevitable.

It's true that \b is nice, but it is not valid gleam syntax.

May 24 '24 20:05 mooreryan

Is there anything stopping Gleam from supporting an extended set of escape codes? I found an old line in the compiler's changelog saying "Gleam now only supports \r, \n, \t, \", and \\ string escapes" which makes me think this is an intentional decision... but why? It seems like an arbitrary decision.

I'll update #602 momentarily to match the \u syntax. I'll also see if I can get it to show the invisible graphemes.

May 24 '24 22:05 Michael-Mark-Edu

Alright, that's done.

May 25 '24 00:05 Michael-Mark-Edu

Edited the parent post of this thread to better represent the current state of the issue/pr.

May 25 '24 03:05 Michael-Mark-Edu

stdlib stdlib copied to clipboard

Escape backspace etc graphemes in string.inspect

stdlib
stdlib copied to clipboard