faq icon indicating copy to clipboard operation
faq copied to clipboard

String length expressed as byte or character count for bencode

Open trantor opened this issue 3 years ago • 3 comments

Hello.

First of all, thanks a lot for the tool. I am, however, encountering problems when dealing with data encoded with bencode. It's a problem I've come across time and again and hopefully one you can address. From what I've seen you've interpreted the string length as the number of bytes the string is encoded as, which should be fine. Since, I guess, the original specs of the format, if we can call them that, were less than crystal clear as to what string length meant, there are many implementations around interpreting the string length as the character count, in Unicode terms the count of codepoints present in the string. Could you create a variant of the bencode format supported by faq that matches the variant interpretation of string length described above? It would make my life a lot easier dealing with these sorts of systems.

Just as a reference, faq would encode (arguably correctly) the JSON { "a": "à" } as the bencode-d d1:a2:àe, while the variant format would encode it as d1:a1:àe, assuming UTF-8 encoded strings.

Thanks in advance.

trantor avatar May 19 '21 15:05 trantor

Hey, that's a pretty interesting problem that I haven't personally run into, even having worked on a pretty widely deployed bencode implementation (on a completely unrelated project).

Do you know of a way to reliably determine whether a file should be interpreted as the variant interpretation and when it should not? Also, do you have any examples of implementations of bencode that support this (even if they're in other langauges)?

jzelinskie avatar May 20 '21 19:05 jzelinskie

Hello @jzelinskie Well, apart from falling back on the variant format and viceversa if the encoding/decoding using the other fails, I don't thinks there's a reliable way to distinguish the two. After all they contain the same data and they diverged due to a different/mistaken interpretation of the format.

As for a practical example of a software using the interpretation I was referring to, you can look https://github.com/Zimbra/zm-mailbox/blob/develop/common/src/java/com/zimbra/common/util/BEncoding.java here for the serialization functions used by the Zimbra Communication Suite in its Java code, i.e. the source of my annoyance ;D . As to a non-internal implementation dealing with such a variation on the theme, I've used some Perl module to deal with it, but I trace it working, I think, to the "flexible" way Perl can allow you to see a string scalar variable as if you don't specifically force it to be a byte-string.

trantor avatar May 22 '21 14:05 trantor

Following up on this, my problem ended up being with an implementation expressing string length as a the count of UTF-16 code units used to represent the string. Pretty removed from the standard implementation, yet it exists. In the end, given my urgency and other implementation problems concerning Bencode I found in faq and reported #93 I threw myself in the deep end of the pool and wrote a modular Bencode decoder/encoder for jq, implementing alternative string length algorithms, which proved interesting although pretty mind-wracking (or wrecking even). To anyone who might need it, the code in question is here.

trantor avatar Aug 28 '21 16:08 trantor