[RFC] Add properties to `List` for byte-based encodings/hashes
The primary motivation here is to allow Pkl to properly work with arbitrary binary data. String can be abused to some extent, but because text encoding (and unicode normalization) is involved, what you see isn't what you get:
❯ pkl eval pkl:base -x 'let (_ = trace(0xff.toRadixString(16))) trace(0xff.toChar())' | hexdump
pkl: TRACE: 0xff.toRadixString(16) = "ff" (repl:text)
pkl: TRACE: 0xff.toChar() = "ÿ" (repl:text)
0000000 bfc3
0000002
I'm most focused on base64 here, but I don't see an reason to omit the hashes either.
This can actually be implemented in-langage, but it's far from performant:
const local base64Chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/".chars
const function base64Bytes(bytes: List<UInt8>|Listing<UInt8>): String =
let (padding = if (bytes.length % 3 == 0) "" else "=".repeat(3 - (bytes.length % 3)))
bytes.fold(List(List()), (acc, elem) ->
if (acc.last.length == 3) acc + List(List(elem))
else acc.dropLast(1) + List(acc.last + List(elem))
)
.map((tuple) ->
let (n = tuple[0].shl(16) + (tuple.getOrNull(1)?.shl(8) ?? 0) + (tuple.getOrNull(2) ?? 0))
"\(base64Chars[n.ushr(18).and(63)])\(base64Chars[n.ushr(12).and(63)])\(base64Chars[n.ushr(6).and(63)])\(base64Chars[n.and(63)])"
)
.join("") + padding
Here's a test encoding a 16kB buffer:
❯ time ../pkl/pkl-cli/build/executable/jpkl eval test.pkl > /dev/null
../pkl/pkl-cli/build/executable/jpkl eval test.pkl > /dev/null 19.55s user 0.36s system 125% cpu 15.862 total
And here's the exact same evaluation switched to use the List.base64 property introduced in this PR:
❯ time ../pkl/pkl-cli/build/executable/jpkl eval test.pkl > /dev/null
../pkl/pkl-cli/build/executable/jpkl eval test.pkl > /dev/null 4.22s user 0.19s system 325% cpu 1.356 total
This is roughly a 12x speedup!
Methods on generic types that only work for some type arguments don't really exist in Pkl yet. I considered an entirely separate stdlib class (possibly even a List subclass?) that forces the element type to UInt8 I'm interested in hearing thoughts on how to best approach this, but this was the most expedient path to prove the concept.
As for why I'm doing this at all, here's a teaser:
Another solution, which might be cleaner, is to have a pkl:bytes standard library module, which has:
external function base64(bytes: List<UInt8>)
Overall, I think we should think about how Pkl should be representing binary data first. Having a .base64 etc for binary data would naturally be part of the binary data API.
Some thoughts about using List<UInt8> for binary data:
Good parts:
- We can probably optimize this by representing
List<UInt8>asbyte[]within our truffle implementation. - No new type introduced.
Concerns:
- Ideally we'd want to export this as the corresponding byte array type in codegen (
byte[]in Java,[]bytein Go,ByteArrayin Kotlin,[UInt8]in Swift). But there's nothing special aboutList<UInt8> - Would also need to special-case messagepack serialization? (should represent this using the
binarymessage format, instead of thearraymessage format).
Submitted a SPICE for byte arrays, see https://github.com/apple/pkl-evolution/pull/14
Superseded by #1019