bstr icon indicating copy to clipboard operation
bstr copied to clipboard

Write escaped string into a buffer

Open lopopolo opened this issue 4 years ago • 4 comments

Hi @BurntSushi,

I'm using bstr for turning a Vec<u8>-like structure into debug strings and error messages. Specifically, I'm working on a Ruby implementation. In Ruby String is a Vec<u8> with a default UTF-8 encoding with no guarantees that the bytes are actually valid UTF-8.

bstr is the means by which I interpret these byte vectors as UTF-8 the best I can.

The fmt::Debug implementation on &BStr is very close to what I'd like, but I cannot use it because it wraps the escaped string in quotes. I need control of the output since these strings are being but into error messages.

I've put together this function for writing the escaped representation to an arbitrary fmt::Write (cribbing heavily form the fmt::Debug impl on &BStr).

pub fn escape_unicode<T>(mut f: T, string: &[u8]) -> Result<(), WriteError>
where
    T: fmt::Write,
{
    let buf = bstr::B(string);
    for (start, end, ch) in buf.char_indices() {
        if ch == '\u{FFFD}' {
            for byte in buf[start..end].as_bytes() {
                write!(f, r"\x{:X}", byte)?;
            }
        } else {
            write!(f, "{}", ch.escape_debug())?;
        }
    }
    Ok(())
}

Here's an example usage:

let mut message = String::from("undefined group name reference: \"");
string::escape_unicode(&mut message, name)?;
message.push('"');
Err(Exception::from(IndexError::new(interp, message)))

I'm trying to generate a message like this:

$ ruby -e 'm = /(.)/.match("a"); m["abc-\xFF"]'
Traceback (most recent call last):
	1: from -e:1:in `<main>'
-e:1:in `[]': undefined group name reference: "abc-\xFF" (IndexError)

Is this patch something you would consider upstreaming?

lopopolo avatar Jan 30 '20 04:01 lopopolo

This looks reasonableish, yes. I'd like to see its API cleaned up a bit. Namely:

  1. It looks like it should be named escape_debug instead of escape_unicode? Namely, escape_unicode in std converts everything to Unicode escapes.
  2. I think it should be named escape_debug_to since it writes to a fmt::Write. This leaves the door open to adding escape_debug implementations that mirror std, but this doesn't need to be in the initial PR.
  3. Add docs along with an example, consistent with the rest of the API. :-)

Thanks for the good idea!

BurntSushi avatar Jan 30 '20 18:01 BurntSushi

Thanks. I’ll work on a PR tonight.

lopopolo avatar Jan 30 '20 20:01 lopopolo

Apologies of leading you down the wrong path here, but as noted in #37, I think we should add APIs that mirror std for this as closely as possible. In particular, we should be able to have an escape_debug method that returns an iterator of char values corresponding to the escaped output. The iterator itself can implement fmt::Write for ergonomics.

This is harder to implement, but I think looking at std should give some inspiration. Note that there is an important difference between bstr and std here. std has an escape_debug impl for char, and since a str is just a sequence of encoded chars, its str::escape_debug method can simply defer to the char implementation. We can't really do that in bstr, so the implementation will need to be a bit different.

BurntSushi avatar May 10 '20 12:05 BurntSushi

I'm sharing this because I believe that it's a step towards @BurntSushi 's proposed solution (just needs mapping from DebugItem -> Iterator<Item=char>, but is also useful for those that want a non-escaped debug string.

  enum DebugItem<'a> {
      NullByte,
      Escaped(core::char::EscapeDebug),
      HexedChar(char),
      HexedBytes(&'a [u8]),
  }

  impl<'a> std::fmt::Display for DebugItem<'a> {
      fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
          match self {
              DebugItem::NullByte => write!(f, "\\0"),
              DebugItem::Escaped(escaped) => write!(f, "{}", escaped),
              DebugItem::HexedBytes(bytes) => {
                  for &b in bytes.as_bytes() {
                      write!(f, r"\x{:02X}", b)?;
                  }
                  Ok(())
              },
              DebugItem::HexedChar(ch) => write!(f, "\\x{:02x}", *ch as u32),
              
          }
      }
  }

  fn iter_debug_items<'a>(debug_str: &'a BStr) -> impl Iterator<Item = DebugItem<'a>> {
      debug_str.char_indices()
          .map(|(s, e, ch)| {
              match ch {
                  '\0' => DebugItem::NullByte,
                  '\u{FFFD}' => {
                      let bytes = debug_str[s..e].as_bytes();
                      if bytes == b"\xEF\xBF\xBD" {
                          DebugItem::Escaped(ch.escape_debug())
                      } else {
                          DebugItem::HexedBytes(bytes)
                      }
                  }
                  // ASCII control characters except \0, \n, \r, \t
                  '\x01'..='\x08'
                  | '\x0b'
                  | '\x0c'
                  | '\x0e'..='\x19'
                  | '\x7f' => {
                      DebugItem::HexedChar(ch)
                  }
                  '\n' | '\r' | '\t' | _ => {
                      DebugItem::Escaped(ch.escape_debug())
                  }
              }
          })
  }

  impl fmt::Debug for BStr {
      #[inline]
      fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
          write!(f, "\"")?;
          for item in iter_debug_items(self) {
              write!(f, "{}", item)?;
          }
          write!(f, "\"")?;
          Ok(())
      }
  }

Michael-J-Ward avatar Sep 30 '22 16:09 Michael-J-Ward