txtorcon
txtorcon copied to clipboard
Python3 bytes/str and desciptors
atagar tells me in #tor-dev that "contact" and "platform" in descriptors can take arbitrary bytes. Everything else from tor-control is ASCII.
But those two special cases can leak through if you do GETINFO desc/*
commands. Which means the current release-1.x
scheme of treating all Tor-sent data as ASCII will work "nearly all" the time -- but we could still run into stupid decoding issues in this one edge case.
Possible solutions:
- just set "ignore errors" on decoding, and decode as ASCII all the time anyway
- ???
- do not profit
FWIW, Stem treats everything as utf-8 and asks for "replace" from the decoder (but does provide an escape-hatch for just get_info): https://gitweb.torproject.org/stem.git/tree/stem/control.py#n1045
For more context, it's not even usual to have "full" descriptors (vs. microdescriptors) anyway. This is sounding veeeeeery edge-case to me. Perhaps a survey of current descriptors to see if this is even used is worth it? Also: I think it's better to have one or two descriptors break with a "?" in them rather than force everyone to all-the-time deal with bytes
rather than str
.
More context: the control-spec says a QuotedString or a CString can contain any 256-bit value but doesn't have an explicit encoding. So ... "presume UTF-8 and ignore/stomp and encoding errors and #yolo" I guess?