jq icon indicating copy to clipboard operation
jq copied to clipboard

Binary strings

Open nicowilliams opened this issue 1 year ago • 89 comments

In the past I've wanted to support binary blobs by treating them as arrays of small integers. I started a small experiment today and it looks to me like adding a sub-type of string that is binary and behaves like a string is much more natural than a sub-type of string that behaves like an array, especially if we were to have the ability use .[] to iterate (which would give us a streaming version of explode).

The goal is to be able to work with a) binary, non-text data, b) work with mangled UTF-8, such as WTF-8. For example of (a), one could try to write a codec for CBOR and other binary JSON formats, or ASN.1 DER, or protocol buffers, or flat buffers, etc.

I'd like to add the fewest possible command-line options, possibly none.

So here's the rough idea here, which this PR right now barely sketches:

  • binary data should be a sub-type of string
  • maybe there should be multiple sub-types of string where the sub-type denotes a) the kind of content, b) what to do on output:
    • we could have binary that is output in base64
    • we could have binary that is an error to output
    • we could have WTF-8 that is output as-is
    • we could have @Maxdamantus' WTF-8b that attempts to encode 0x80-0xff as overlong UTF-8 sequences
  • add tobinary/1 which makes a binary out of a stream of bytes that will be an error to output if it's not valid UTF-8
  • add tobinary/0 which makes a binary out of a string (this may seem silly, but .[] on strings should output a stream of Unicode codepoints, while .[] should output a stream of bytes)
  • add a encodeas/1 which sets the encoding for the given value (currently only for strings and binary strings) to one of "UTF-8", "base64", or "bytearray"
  • add encoding/0 which outputs the output encoding of its input string/binary value
  • make tostring/0 work with binary strings of all types doing the usual bad codepoint replacement thing
  • add a family of builtins that are like input and inputs, but which let one read raw inputs, JSON w/ WTF-8, etc.
  • add a command-line option(s) for input forms
    • one option to read raw input as binary
    • one option to read raw input as binary delimited by some byte value

The current state of this PR is pretty poor -- just a sketch, really. Here's the TODO list:

  • ~[ ] meld with the JVP_FLAGS thing done for numbers?~
    • ~[ ] make string kinds (UTF-8, binary) and output encoding flags (base64, array of bytes, ...) JVP_FLAGS, or~
    • ~[ ] move JVP_FLAGS to the pad_ char field of jv that would now be called flags or subkind~
  • [x] add jv_binary_*() functions
  • [x] add a jv_get_string_kind()
  • [x] let jv_string_concat() and others work with binary
  • [x] let .[] iterate the codepoints in a string
  • [x] let .[] iterate the bytes in a binary string
  • ~[x] let .[$index] address the $indexth codepoint in . if it's a string~ (see commentary below)
  • [x] let .[$index] address the $indexth byte in . if it's a binary blob
  • [x] implement JSON encoder options
    • [x] base64 (encodeas("base64"), the default)
    • [x] hex (encodeas("hex"))
    • [x] array of byte values (encodeas("bytearray"))
      • [x] properly indent these arrays (currently they're always compact)
    • [x] convert to UTF-8 with bad character mappings
    • [x] no encoding in --raw-output-binary mode
    • ~[ ] WTF-8?~ (punt for now)
    • ~[ ] WTF-8b?~ (punt for now)
    • ~[ ] other encodings?~ (punt for now)
  • [x] support flattening of arrays of bytes and arrays of .. bytes (e.g., [0,[[[1,2],3],4],5]in converting to binary
  • [x] add stringtype/0
  • [x] add tobinary/0
  • [x] add encodeas/1
  • [x] add encoding/0
  • [x] add tobinary/1
  • ~[ ] add towtf8/1~
  • ~[ ] add towtf8/0~ (punt for now)
  • ~[ ] add tobase64/0~ (punt for now)
  • [x] @base64d base64 decoder should produce binary as if by tobinary_utf8
  • ~[ ] add a frombase64/0 that only produces binary to avoid having to check if the result is valid UTF-8~
  • [x] make tostring/0 accept binary strings and do bad codepoint replacement as usual
  • ~[ ] add a family of functions like input and inputs, but w/ caller control over the input formats (this is pretty ambitious, possibly not possible)~ (let's leave this for later)
  • ~[ ] add binary literal forms (b"<hex-encoded>", b64"<base64-encoded>")? (not strictly needed, since one could use "<base64-encoded>"|frombase64 or some such, and we could even make the compiler constant-fold that)~ (let's leave this for later)
  • [x] --raw-output-binary mode
  • [ ] --raw-input-mode BLOCK-SIZE mode that produces binary input and inputs, with the default output encoding, reading raw binary strings of up to BLOCK-SIZE bytes (and if --slurp is given, concatenate all the blocks and run the jq program on the one slurped input)
  • [x] add docs
    • [x] add mantests of tobinary and encodeas
    • [x] add mantests of string codepoint iteration and indexing, and binary string byte iteration and indexing
  • [x] add shtests

Questions:

  • is this a bad idea?
  • is .[] for strings a bad idea? (A: Apparently yes. See commentary below.) EDIT: We already have string slicing. Adding string indexing and iteration seems to complete the picture.
  • what's missing?
    • A binary input mode.

nicowilliams avatar Jul 20 '23 03:07 nicowilliams

adding a sub-type of string that is binary and behaves like a string is much more natural than a sub-type of string that behaves like an array

I'm concerned that the former approach (making blobs behave like strings) will be troublesome or confusing, basically because "strings" are already troublesome enough. (JSON? Raw? Sequences of codepoints? Valid UTF8? Invalid?)

Consider for example .[]. You have envisioned string[] as yielding a stream of strings, which is useful and intuitive, but if binary data is represented in a way that makes it "behave like a string", wouldn't that mean that however blob[] is defined, it will be problematic in one way or another? If it iterates bytes as you envision, then string[] would have to iterate the codepoints.

Well, maybe that wouldn't be so bad, but consider another example: length. The length of a blob would surely just be the length of the corresponding array.

If the main goal of supporting blobs is to have a highly compact way of storing large arrays of small integers in a way that allows for various operations and transformations to be implemented efficiently, then your original intuition seems to me correct.

No doubt I'm missing something important, but this does seem like a fine opportunity to plug for string[i] as shorthand for string[i: i+1] :-)

pkoppstein avatar Jul 20 '23 06:07 pkoppstein

add binary literal forms (b"", b64"")? (not strictly needed, since one could use ""|frombase64 or some such, and we could even make the compiler constant-fold that)

As a possible alternative to the hex notation, I think supporting "\xHH" notation in string literals would be useful (I didn't want to add it in my PR because I wanted to avoid adding new features). I suspect it shouldn't be allowed in actual JSON string literals (since the notation is not allowed in JSON [0]), but it could be allowed in jq string literals.

[0] Though I have wondered if it could make sense to have a flag to allow it and also to emit it, instead of emitting the illegal UTF-8 bytes. This is actually my biggest gripe against JSON: the fact that it has a notation for representing arbitrary sequences of UTF-16 code units but not arbitrary sequences of UTF-8 code units. Perhaps with some hindsight, there could have been an expectation for UTF-16 systems to interpret "\xHH" sequences as UTF-8 just as UTF-8 systems interpret "\uHHHH" sequences as UTF-16.

Maxdamantus avatar Jul 20 '23 09:07 Maxdamantus

what's missing?

While tobinary would be catering to 8-bit string processing (where the indexing operations work as in C, Go, Rust), it might also be worth adding something that caters to 16-bit string processing (where the indexing operations work as in JavaScript or Java). This would probably be a matter of adding tobinary16 in parallel (tobinary could actually be tobinary8 if we want to be extremely clear).

("💩" | length) == 1 # like in Python
("💩" | tobinary8 | length) == 4 # like in C
("💩" | tobinary16 | length) == 2 # like in JavaScript

Maxdamantus avatar Jul 20 '23 10:07 Maxdamantus

adding a sub-type of string that is binary and behaves like a string is much more natural than a sub-type of string that behaves like an array

I'm concerned that the former approach (making blobs behave like strings) will be troublesome or confusing, basically because "strings" are already troublesome enough. (JSON? Raw? Sequences of codepoints? Valid UTF8? Invalid?)

I'm not following.

Consider for example .[]. You have envisioned string[] as yielding a stream of strings, which is useful and intuitive, but if binary data is represented in a way that makes it "behave like a string", wouldn't that mean that however blob[] is defined, it will be problematic in one way or another? If it iterates bytes as you envision, then string[] would have to iterate the codepoints.

Yes, $string[] should, will, and in this draft PR as it stands now does indeed iterate codepoints -- just like explode, but streaming.

$blob[] would iterate bytes.

[$blob[]] and [$string[]]would be like explode.

Well, maybe that wouldn't be so bad, but consider another example: length. The length of a blob would surely just be the length of the corresponding array.

The length of a blob would be the number of bytes in it, not codepoints or anything else. A binary datum being binary, colloquially meaning an array of bytes, this is the only natural thing to do.

If the main goal of supporting blobs is to have a highly compact way of storing large arrays of small integers in a way that allows for various operations and transformations to be implemented efficiently, then your original intuition seems to me correct.

Certainly there's nothing unnatural about representing blobs as arrays of bytes. But I think there's nothing unnatural about representing them as non-UTF-8 strings too, and in terms of what I would have to do to src/jv.c, I think the latter is better than the former.

Ah, that's another thing, we currently have string slice syntax "foo"[0:1] ("f"), but we have neither string iteration ("foo"[]) nor string indexing ("foo"[1]). Indexing of strings would have to be as for string slices: by codepoint number, not by byte number. Indexing of binary blobs would have to be by byte number.

If we had .[], .[$index], and .[$start:$end] for strings and blobs then they would feel a lot like arrays. The only thing is that iterating/indexing strings cannot be a path expression (and in the current state of this PR it happens to not be a path expression, so that's good and done).

So I think simply adding iteration and indexing support for strings and blobs is enough to get the semantics I'd originally had in mind for binary as arrays of small integers, but with the benefit that there would be no concerns like "what happens if you have a binary (array of bytes) and try to append or set a value that is not a byte value?".

Also, thinking about it, representing binary blobs as arrays of bytes would have presented difficulties w.r.t. path expressions. Since string iteration/indexing wouldn't contribute to path expressions, I now think it's more natural to represent binary as a sub-type of strings. Also, in other languages binary is typically string-like, at least as to literal value syntax.

No doubt I'm missing something important, but this does seem like a fine opportunity to plug for string[i] as shorthand for string[i: i+1] :-)

Yes! I was missing that. I'll add it.

nicowilliams avatar Jul 20 '23 16:07 nicowilliams

what's missing?

While tobinary would be catering to 8-bit string processing (where the indexing operations work as in C, Go, Rust), it might also be worth adding something that caters to 16-bit string processing (where the indexing operations work as in JavaScript or Java). This would probably be a matter of adding tobinary16 in parallel (tobinary could actually be tobinary8 if we want to be extremely clear).

("💩" | length) == 1 # like in Python
("💩" | tobinary8 | length) == 4 # like in C
("💩" | tobinary16 | length) == 2 # like in JavaScript

UTF-16 is proof that -or at least very strongly suggestive of- time machines don't exist, and never will exist, or are/will be too expensive to use, or that fear of paradoxes will limit their use to just observation.

UTF-16 needs to die in a fire, and if jq not supporting it helps it die, so much the better!

Now, more seriously, if we had a byteblob binary type, we would also then be able write jq code that uses that to implement UTF-16. Having a string sub-type that is UTF-16 might have some value, but I would like first to get experience with byte blobs before we add UTF-16 support.

nicowilliams avatar Jul 20 '23 16:07 nicowilliams

As a possible alternative to the hex notation, I think supporting "\xHH" notation in string literals would be useful (I didn't want to add it in my PR because I wanted to avoid adding new features). I suspect it shouldn't be allowed in actual JSON string literals (since the notation is not allowed in JSON [0]), but it could be allowed in jq string literals.

Indeed, I'm not interested in innovating in JSON. Having participated in IETF threads with thousands of posts about publishing RFC 7259, I'm not inclined to believe that we could alter JSON to support binary, and I do not relish the thought of repeating that experience.

[0] Though I have wondered if it could make sense to have a flag to allow it and also to emit it, instead of emitting the illegal UTF-8 bytes. This is actually my biggest gripe against JSON: the fact that it has a notation for representing arbitrary sequences of UTF-16 code units but not arbitrary sequences of UTF-8 code units. Perhaps with some hindsight, there could have been an expectation for UTF-16 systems to interpret "\xHH" sequences as UTF-8 just as UTF-8 systems interpret "\uHHHH" sequences as UTF-16.

With string sub-types indicating output options we could certainly allow oddball, not-quite-JSON formats like JSON w/ WTF-8, but for true binary I am only interested in either emitting errors or auto-base64-encoding for now. Eventually something like WTF-8b would indeed allow encoding of binary as something very close to UTF-8, if not actually UTF-8 (like, if we used private use codepoints to represent the WTF-8 encoding of broken surrogates then the result could be true UTF-8 rather than WTF-8). But even here we'd be stepping on the Unicode Consortium's toes -- it would be much much better, but also much much harder, to get the UC to allocate 128 codepoints for this purpose and then define something like a Unicode encoding of binary data.

So you can see I'm reluctant to innovate on the JSON side and the Unicode fronts. I'm not resolutely opposed to it though: we could have command-line options to enable these for input/output, and we could label them experimental. But I'd like to get something a bit more standards-compliant done first.

nicowilliams avatar Jul 20 '23 17:07 nicowilliams

I now see that this approach and my old "binary as array of small integers" idea are... remarkably similar. The differences are:

  • what type is reported by type ("array" vs "string")
  • how binary blobs are encoded on output (array of bytes vs several options possibly including array of bytes)

As long as we add .[] and .[$index] for both, UTF-8 strings and binary strings, binary blobs as array of bytes or binary blobs as strings will work very much the same way.

nicowilliams avatar Jul 21 '23 02:07 nicowilliams

... remarkably similar.

Hmmm. That's largely what I was trying to say :-)

But let me outline two radical variations of the "blob as array of bytes" idea.

For brevity, I'll use $ab to signify a JSON array of integers in range(0;256).

The two variants are:

  1. jq adopts a convention such as identifying JSON objects having the form {"class": "blob", "value": $ab} with elements of what we can think of as "class blob". This would allow for efficient handling of blobs, and provide a model for handling of non-JSON "types" in future.
  2. jq manages a quasi-hidden "is-a-blob" flag on arrays, and provides a bunch of new filters, e.g. for reporting whether an array is an $ab, and for transforming an $ab to other representations. (Many of these new filters would raise an error, e.g. if the input is expected to be an $ab but isn't, or if there's something about the $ab that prevents the requested transformation.)

Of course, both techniques can be used if one wants to support both binary8 and binary16.

pkoppstein avatar Jul 21 '23 07:07 pkoppstein

But even here we'd be stepping on the Unicode Consortium's toes -- it would be much much better, but also much much harder, to get the UC to allocate 128 codepoints for this purpose and then define something like a Unicode encoding of binary data.

It's not really possible to correctly do it this way. The only correct way to encode invalid Unicode in such a way that valid Unicode is passed through unchanged (ie, all valid UTF-8 strings have the same meaning in WTF-8) is to encode the ill-formed Unicode sequences into ill-formed Unicode sequences.

WTF-8 works by encoding ill-formed UTF-16 (unpaired surrogates) into invalid[0] UTF-8 (invalid "generalised UTF-8" encodings of UTF-16 surrogate code points). Any valid UTF-16 already has a corresponding valid UTF-8 encoding, and vice versa—these encodings can't be reused.

The "WTF-8b" extension additionally encodes ill-formed UTF-8 bytes as other invalid UTF-8 bytes. This includes all WTF-8-specific and WTF-8b-specific sequences (it's fundamentally not possible for this process to be idempotent, since it should not be possible to generate encoded UTF-16 errors from UTF-8 binary data).

If ill-formed Unicode is encoded as valid Unicode, it won't be distinguishable from previously valid Unicode. It would be particularly incorrect to emit invalid Unicode in response to certain valid Unicode (eg, text that happens to contain these 128 hypothetical code points—they would still be Unicode scalar values, so they can appear in valid UTF-8 or UTF-16 text ... or binary data that just happens to look like such UTF-8 text).

[0] I'm distinguishing here between "ill-formed" and "invalid", where "invalid" bytes would be an ill-formed sequence that never occurs as a substring of valid Unicode text—these forms are particularly useful in WTF-8/WTF-8b since they can not be generated accidentally through string concatenation

Maxdamantus avatar Jul 21 '23 08:07 Maxdamantus

It's not really possible to correctly do it this way. The only correct way to encode invalid Unicode in such a way that valid Unicode is passed through unchanged (ie, all valid UTF-8 strings have the same meaning in WTF-8) is to encode the ill-formed Unicode sequences into ill-formed Unicode sequences.

That works provided other systems understand it. Encoding non-UTF-8 as valid UTF-8 with special codepoints also only works if other systems understand how to decode that, but it has the advantage that other systems that do not know how to decode it will pass it through unmolested.

nicowilliams avatar Jul 21 '23 14:07 nicowilliams

That works provided other systems understand it

I think the purpose of any such encoding should only be for internal use. Except for debugging purposes, it should preferably not be possible to observe the internal byte representation of these strings.

I think a reasonable way of exposing the error bytes/surrogates would be as negative code points when iterating, eg: ("foo\xFF\uD800💩" | explode_raw) == [102, 111, 111, -255, -55296, 128169] This way the errors can be detected with a simple . < 0 check, and they can also be passed as-is to an inverse implode_raw operation. Come to think of it, maybe my PR should also be doing this using the internal WTF-8b iteration function (it currently only uses negative code points for denoting the UTF-8 errors, not the UTF-16 errors).

Maxdamantus avatar Jul 21 '23 14:07 Maxdamantus

@leonid-s-usov I'm trying to understand the jvp flags thing you added. What is the intent regarding adding new flags? Why not use the pad_ field for flags and leave the kind field alone?

nicowilliams avatar Jul 21 '23 15:07 nicowilliams

1. jq adopts a convention such as identifying JSON objects having the form {"class": "blob", "value": $ab} with elements of what we can think of as "class blob".  This would allow for efficient handling of blobs, and provide a model for handling of non-JSON "types" in future.

jq will only every support JSON types -- new value types can't be added because they can't be represented in JSON. jq could add a typing mechanism that amounts to JSON schemas for data, and maybe typing for jq functions too (so we could do typechecking), but this is all way beyond the scope of this PR, and I don't think binary data support should wait for any of that.

If we had to have a notion of "class" I'd do it a bit like Perl5: provide a way to bless a JSON object (and maybe arrays too) with a "class" and add a way to find out the class of one, but with the addition of a JSON schema and validation. Obviously a lot can be debated there, but definitely jq cannot add new value types.

2. jq manages a quasi-hidden "is-a-blob" flag on arrays, and provides a bunch of new filters, e.g. for reporting whether an array is an $ab, and for transforming an $ab to other representations. (Many of these new filters would raise an error, e.g. if the input is expected to be an $ab but isn't, or if there's something about the $ab that prevents the requested transformation.)

String or array makes little difference now, but I much prefer string now. Again, the only real difference now would be what type reports. Internally (i.e., in the jv API, and in the implementation of the EACH, EACH_OPT, INDEX, and INDEX_OPT instructions) though I am now convinced that binary should be a flavor of string not of array.

Maybe someone can make a convincing argument that allowing .[] and .[idx] for strings breaks backwards compatibility seriously enough that we shouldn't have type for binary return "string". Certainly one could make an argument that it breaks backwards compatibility. For example one could use the fact that "string"[] raises an error as a way to check whether the type of a value is an iterable, but I wouldn't find that example convincing because we do provide type.

Of course, both techniques can be used if one wants to support both binary8 and binary16.

There's no reason that 16-bit word strings couldn't be a flavor of "string" too. If it's UTF-16 then it can be converted to UTF-8 on output. If it's not UTF-16 then it can be base64-encoded or encoded as an array of 16-bit unsigned integers just like 8-bit binary.

So the only arguments I see here are about a) which is more natural for type to report for a binary blob ("string" or "array"), and b) whether it's OK to add .[] and .[index] for values whose type is "string". I don't think (a) is very interesting but I now prefer the answer to be "string" if nothing else because .[] and .[index] on strings will not be path expressions, but they are path expressions for array value inputs, and it'd be rather strange to have some arrays for which they are not path expressions. I do think (b) is mildly interesting, but I don't have examples of how adding .[] and .[index] for strings would be an unacceptable change.

nicowilliams avatar Jul 21 '23 16:07 nicowilliams

@nicowilliams wrote:

jq will only every support JSON types

Precisely. That's the whole point of my two variations. You used the term "flavor", so by all means go with that if you prefer.

To summarize: The first variation basically involves new filters and a convention about JSON objects. These can both be ignored entirely by the user; and if the user ignores them, there will be no impact on the user.

The second variation is even less visible, as there is no convention, just some new filters and some behind-the-scenes stuff.

pkoppstein avatar Jul 21 '23 21:07 pkoppstein

@pkoppstein you might want to kick the tires on this. It's starting to be usable!

: ; ./jq -cn '"foob"|tobinary|[type,stringtype]'
["string","binary"]
: ; ./jq -cn '"foob"|tobinary|256+.'
jq: error (at <unknown>): number (256) and string ("Zm9vYg==") cannot be added
: ; ./jq -cn '"foob"|tobinary|.+256'
jq: error (at <unknown>): string ("Zm9vYg==") and number (256) cannot be added because the latter is not a valid byte value
: ; ./jq -cnr '"foob"|tobinary|.+255' | base64 -d | od -t x1
0000000 66 6f 6f 62 ff
0000005
: ; ./jq -cnr '"foob"|tobinary_bytearray|.+255'
[102,111,111,98,255]
: ; ./jq -cnr '"foob"|tobinary_utf8|.+255'
foob�
: ; ./jq -cn '"foob"|tobinary_utf8'
"foob"
: ; ./jq -cn '["foob"|tobinary|(.+255)[]]'
["f","o","o","b","ÿ"]
: ; ./jq -cn '["foob"|tobinary|tostring[]]'
["Z","m","9","v","Y","g","=","="]
: ; ./jq -cn '["foob"|tobinary|tostring[]]'
["Z","m","9","v","Y","g","=","="]

Conversions to base64, byte array, or UTF-8 (w/ bad character mapping) happen on output or on tostring.

nicowilliams avatar Jul 21 '23 22:07 nicowilliams

I might punt on WTF-8 and let @Maxdamantus implement that on top of this when this is done :)

nicowilliams avatar Jul 21 '23 22:07 nicowilliams

I'm getting three compilation errors:

Try again now?

nicowilliams avatar Jul 22 '23 02:07 nicowilliams

Try again now

Yay! [Unless you strenuously object, I propose deleting completely useless and ephemeral messages in this thread (and potentially others, too).]

I noticed that you're proposing to extend + to allow both:

tobinary|.+255   #1 

and

tobinary_bytearray|.+255 #2

The other day, you were warning about the perils of polymorphism, so I'm a bit concerned about both for that kind of reason; more particularly, though, since you want "binary" to be string-like, you'd expect something like:

tobinary | . + ([255]|tobinary)  #1'

or at least:

tobinary | . + (255|tobinary)  #1''

More importantly, #2 seems quite wrong from a jq-perspective: since a bytearray prints as an array of integers, one would expect to have to write:

tobinary_bytearray|.+[255]   #2'

pkoppstein avatar Jul 22 '23 03:07 pkoppstein

Kicking the tires...

What might be done about the proliferation of unwieldy names?

Since you've introduced tobinary and allow tobinary|tostring, there's also an element of inconsistency with having to write tobinary_bytearray and tobinary_utf8 (with more to come?).

Agreed, "tobytearray" and "toutf8" are unreadable at best and unacceptable at worst, so I was wondering what alternatives there might be. An underscore? camelCase? Or better, something with a tad more extensibility, such as defining to/1 so we'd write to("bytearray") or to("utf8"), etc.

pkoppstein avatar Jul 22 '23 03:07 pkoppstein

Try again now

Yay! [Unless you strenuously object, I propose deleting completely useless and ephemeral messages in this thread (and potentially others, too).]

I noticed that you're proposing to extend + to allow both:

tobinary|.+255   #1 

Yes. Maybe that should be addition of binary and array of numbers. But it was easier to code binary and number.

NOTE: Addition of non-binary string and number is still not allowed.

and

tobinary_bytearray|.+255 #2

The other day, you were warning about the perils of polymorphism, so I'm a bit concerned about both for that kind of reason; more particularly, though, since you want "binary" to be string-like, you'd expect something like:

Indeed! I've been wondering whether to add a function to append bytes to binary strings. Maybe we should start with a function now and add an operator later. Or maybe we could introduce || as an operator for this (but, that would be confusable, so best not), or something else.

tobinary | . + ([255]|tobinary)  #1'

or at least:

tobinary | . + (255|tobinary)  #1''

More importantly, #2 seems quite wrong from a jq-perspective: since a bytearray prints as an array of integers, one would expect to have to write:

tobinary_bytearray|.+[255]   #2'

Agreed.

This is the kind of kicking the tires I was looking for, thank you.

nicowilliams avatar Jul 22 '23 04:07 nicowilliams

@nicowilliams -

With regard to all the names you're introducing, perhaps it would be worthwhile stepping back a moment to think about other contexts where one might want to add "flavors". Here are two such contexts:

(1) integers (as a flavor of "number")

(2) complex numbers (as a flavor of "object", having in mind a schema such as {r, i})

So ... rather than stringtype, numbertype, and objecttype, we could have some arity-0 filter to emit the "flavor" (e.g. "binary", "integer", "complex") and generally the "least superclass".

Some names that come to mind for this arity-0 filter are: isa, supertype, owner, flavor, ...

pkoppstein avatar Jul 22 '23 04:07 pkoppstein

Agreed, "tobytearray" and "toutf8" are unreadable at best and unacceptable at worst, so I was wondering what alternatives there might be. An underscore? camelCase? Or better, something with a tad more extensibility, such as defining to/1 so we'd write to("bytearray") or to("utf8"), etc.

My preference would probably be to use tostring8 (for binary data that gets printed as text (like a normal string), but iterates over 8-bit code units (bytes)) and tobinary8 (for binary data that gets printed as base64, but iterates over 8-bit code units(bytes)).

If in the future we want to add string indexing that works like in JavaScript/Java, that would naturally involve adding a tostring16 function (for binary data that gets printed as text (like a normal string), but iterates over 16-bit code units). Admittedly, the analogy seems to end there, because I'm not aware of a base64 analogue for 16-bit data (so there's no obvious meaning for a tobinary16 function).

To be honest, I'm not sure what the purpose of the distinct tobinary_bytearray function is, since it seems like it would be equivalent to tobinary | .[] or tostring8 | .[]. If it's meant to be an optimised version of these, would it not be possible to handle the .[] operation as a special case on these binary types?

Maxdamantus avatar Jul 22 '23 04:07 Maxdamantus

@nicowilliams -

With regard to all the names you're introducing, perhaps it would be worthwhile stepping back a moment to think about other contexts where one might want to add "flavors". Here are two such contexts:

(1) integers (as a flavor of "number")

More flavors == more branches. For binary I think the branches I'm adding are not a big deal. Though let's face it, no one will be doing numerical analysis with jq...

(2) complex numbers (as a flavor of "object", having in mind a schema such as {r, i})

Possible.

So ... rather than stringtype, numbertype, and objecttype, we could have some arity-0 filter to emit the "flavor" (e.g. "binary", "integer", "complex") and generally the "least superclass".

Possibly.

Some names that come to mind for this arity-0 filter are: isa, supertype, owner, flavor, ...

I'm a bit leery of limiting our future ability to add better types (always as schema on top of JSON types).

nicowilliams avatar Jul 22 '23 04:07 nicowilliams

What might be done about the proliferation of unwieldy names?

Since you've introduced tobinary and allow tobinary|tostring, there's also an element of inconsistency with having to write tobinary_bytearray and tobinary_utf8 (with more to come?).

Where's the inconsistency? They're all strings.

Agreed, "tobytearray" and "toutf8" are unreadable at best and unacceptable at worst, so I was wondering what alternatives there might be. An underscore? camelCase? Or better, something with a tad more extensibility, such as defining to/1 so we'd write to("bytearray") or to("utf8"), etc.

Yes, I had a similar thought, except I greatly dislike camel case and would much rather use underscores. A /1 would fit with what's in jq already, though under the covers I'd still have a bunch of /0s.

nicowilliams avatar Jul 22 '23 04:07 nicowilliams

My preference would probably be to use tostring8 (for binary data that gets printed as text (like a normal string), but iterates over 8-bit code units (bytes)) and tobinary8 (for binary data that gets printed as base64, but iterates over 8-bit code units(bytes)).

I'm not opposed to the 8 suffix. Might as well call it tou8 then and save some typing and resemble recent trends in programming languages. EDIT: Er, toi8, not tou8.

If in the future we want to add string indexing that works like in JavaScript/Java, that would naturally involve adding a tostring16 function (for binary data that gets printed as text (like a normal string), but iterates over 16-bit code units). Admittedly, the analogy seems to end there, because I'm not aware of a base64 analogue for 16-bit data (so there's no obvious meaning for a tobinary16 function).

A tobinary16 would produce a string of 16-bit code units which when output by jq to stdout/stderr would be base64-encoded, but which during the jq program's execution would actually be a string of 16-bit code units, just like tobinary is now for 8-bit code units. The base64 thing is just about the representation when serializing to JSON and when applying tostring to a binary.

To be honest, I'm not sure what the purpose of the distinct tobinary_bytearray function is, since it seems like it would be equivalent to tobinary | .[] or tostring8 | .[]. If it's meant to be an optimised version of these, would it not be possible to handle the .[] operation as a special case on these binary types?

Yes, the byte array representation can be produced by just doing [$some_binary[]] -- having that happen automatically on output is just an optimization -- an illusion, because the whole time the string is just a buffer of bytes and not an array of numeric jvs. sizeof(jv) being 16, not converting binary strings to byte arrays until they must be output is quite an optimization!

nicowilliams avatar Jul 22 '23 04:07 nicowilliams

@nicowilliams wrote:

Since you've introduced tobinary and allow tobinary|tostring, there's also an element of inconsistency with having to write tobinary_bytearray and tobinary_utf8 (with more to come?).

Where's the inconsistency? They're all strings.

Following the tobinary|tostring model, you'd allow tobinary|tobytearray.

Following the tobinary_bytearray model, you'd allow tobinary_string or some such.

Yes, I realize that you can justify the (apparent) inconsistency by distinguishing between JSON types and jq flavors, but that still leaves cumbersomeness.

In any case, using to/1 for jq flavors (and why not JSON types as well?) would have its advantages.

pkoppstein avatar Jul 22 '23 05:07 pkoppstein

A tobinary16 would produce a string of 16-bit code units which when output by jq to stdout/stderr would be base64-encoded, but which during the jq program's execution would actually be a string of 16-bit code units, just like tobinary is now for 8-bit code units. The base64 thing is just about the representation when serializing to JSON and when applying tostring to a binary.

Ah right, I guess base64 could still make sense there, though I would expect it would be a base64 encoding of the 16-bit data, rather than the 8-bit data (meaning there would be a question about UTF-16BE vs. UTF-16LE), eg: ("a\uD800" | tobinary16 | tojson | fromjson) == "YQAA2A==" ("a\x00\x00\xD8" | @base64) == "YQAA2A==" Note that there is no particular encoding of "\uD800" as bytes, so it would have to be turned into a replacement character when converting from 8-bit data to base64. In the 16-bit example above, "\uD800" is represented as <00 D8> in UTF-16LE, so the data is preserved.

Maxdamantus avatar Jul 22 '23 05:07 Maxdamantus

I think I might go for tobinary is as in this PR right now, and add a encodeas/1 to replace the other tobinary_*.

nicowilliams avatar Jul 22 '23 16:07 nicowilliams

2.5 cheers for encodeas!

But the manual.yml entry for encodeas/1 is not quite right.

The current text reads:

      This function sets the encoding of any binary string input
      to the given `$encoding`, which must be one of `"UTF-8"`
      (apply bad character mappings), `"$base64"` (encode binary
      in base64), or `"$bytearray"` (encode binary as an array of
      unsigned byte values).  The result will be encoded
      accordingly when when passed to `tostring` or when finally
      output by jq to `stdout` or `stderr`.

(1) The "$" in "$base64" and "$bytearray" should be removed:

$ ./jq -n '"abc" | tobinary | encodeas("bytearray")' [97,98,99]

(2) "when when" => "when"

You might also mention that the inverse of tobinary as a map from JSON strings is (or at least is effectively) encodeas("UTF-8")


Also, I don't see how on the one hand:

./jq -n '"💩" | tobinary | encodeas("bytearray")' [240,159,146,169]

but on the other:

./jq -n '"💩" | tobinary | encodeas("bytearray") | length'
1

Why isn't the length of the bytearray equal to 4? Note also that applying [] yields a stream with 4 items, as one would expect:

./jq -n '"💩" | tobinary | encodeas("bytearray")[]'
"ð"
"Ÿ"
"’"
"©"

pkoppstein avatar Jul 22 '23 21:07 pkoppstein

2.5 cheers for encodeas!

Yeah, I like it too :) Inspired by your commentary.

Also, I don't see how on the one hand:

./jq -n '"💩" | tobinary | encodeas("bytearray")' [240,159,146,169]

but on the other:

./jq -n '"💩" | tobinary | encodeas("bytearray") | length'
1

Why isn't the length of the bytearray equal to 4? Note also that applying [] yields a stream with 4 items, as one would expect:

Because the encoding as a byte array doesn't happen until you either send the output to tostring or to stdout/`stderr. The manual explains that, though perhaps it's not obvious enough and needs further clarification.

nicowilliams avatar Jul 22 '23 23:07 nicowilliams