txtorcon icon indicating copy to clipboard operation
txtorcon copied to clipboard

Python3 bytes/str and desciptors

Open meejah opened this issue 8 years ago • 3 comments

atagar tells me in #tor-dev that "contact" and "platform" in descriptors can take arbitrary bytes. Everything else from tor-control is ASCII.

But those two special cases can leak through if you do GETINFO desc/* commands. Which means the current release-1.x scheme of treating all Tor-sent data as ASCII will work "nearly all" the time -- but we could still run into stupid decoding issues in this one edge case.

Possible solutions:

  • just set "ignore errors" on decoding, and decode as ASCII all the time anyway
  • ???
  • do not profit

meejah avatar Apr 10 '16 19:04 meejah

FWIW, Stem treats everything as utf-8 and asks for "replace" from the decoder (but does provide an escape-hatch for just get_info): https://gitweb.torproject.org/stem.git/tree/stem/control.py#n1045

meejah avatar Apr 10 '16 19:04 meejah

For more context, it's not even usual to have "full" descriptors (vs. microdescriptors) anyway. This is sounding veeeeeery edge-case to me. Perhaps a survey of current descriptors to see if this is even used is worth it? Also: I think it's better to have one or two descriptors break with a "?" in them rather than force everyone to all-the-time deal with bytes rather than str.

meejah avatar Apr 10 '16 20:04 meejah

More context: the control-spec says a QuotedString or a CString can contain any 256-bit value but doesn't have an explicit encoding. So ... "presume UTF-8 and ignore/stomp and encoding errors and #yolo" I guess?

meejah avatar Mar 22 '17 18:03 meejah