pkl icon indicating copy to clipboard operation
pkl copied to clipboard

[RFC] Add properties to `List` for byte-based encodings/hashes

Open HT154 opened this issue 11 months ago • 3 comments

The primary motivation here is to allow Pkl to properly work with arbitrary binary data. String can be abused to some extent, but because text encoding (and unicode normalization) is involved, what you see isn't what you get:

❯ pkl eval pkl:base -x 'let (_ = trace(0xff.toRadixString(16))) trace(0xff.toChar())' | hexdump                 
pkl: TRACE: 0xff.toRadixString(16) = "ff" (repl:text)
pkl: TRACE: 0xff.toChar() = "ÿ" (repl:text)
0000000 bfc3                                   
0000002

I'm most focused on base64 here, but I don't see an reason to omit the hashes either.

This can actually be implemented in-langage, but it's far from performant:

const local base64Chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/".chars
const function base64Bytes(bytes: List<UInt8>|Listing<UInt8>): String =
  let (padding = if (bytes.length % 3 == 0) "" else "=".repeat(3 - (bytes.length % 3)))
    bytes.fold(List(List()), (acc, elem) ->
      if (acc.last.length == 3) acc + List(List(elem))
      else acc.dropLast(1) + List(acc.last + List(elem))
    )
      .map((tuple) ->
      let (n = tuple[0].shl(16) + (tuple.getOrNull(1)?.shl(8) ?? 0) + (tuple.getOrNull(2) ?? 0))
        "\(base64Chars[n.ushr(18).and(63)])\(base64Chars[n.ushr(12).and(63)])\(base64Chars[n.ushr(6).and(63)])\(base64Chars[n.and(63)])"
    )
      .join("") + padding

Here's a test encoding a 16kB buffer:

❯ time ../pkl/pkl-cli/build/executable/jpkl eval test.pkl > /dev/null
../pkl/pkl-cli/build/executable/jpkl eval test.pkl > /dev/null  19.55s user 0.36s system 125% cpu 15.862 total

And here's the exact same evaluation switched to use the List.base64 property introduced in this PR:

❯ time ../pkl/pkl-cli/build/executable/jpkl eval test.pkl > /dev/null
../pkl/pkl-cli/build/executable/jpkl eval test.pkl > /dev/null  4.22s user 0.19s system 325% cpu 1.356 total

This is roughly a 12x speedup!

Methods on generic types that only work for some type arguments don't really exist in Pkl yet. I considered an entirely separate stdlib class (possibly even a List subclass?) that forces the element type to UInt8 I'm interested in hearing thoughts on how to best approach this, but this was the most expedient path to prove the concept.

As for why I'm doing this at all, here's a teaser: Screenshot 2025-02-01 at 16 58 38

HT154 avatar Feb 02 '25 01:02 HT154

Another solution, which might be cleaner, is to have a pkl:bytes standard library module, which has:

external function base64(bytes: List<UInt8>)

bioball avatar Feb 03 '25 15:02 bioball

Overall, I think we should think about how Pkl should be representing binary data first. Having a .base64 etc for binary data would naturally be part of the binary data API.

Some thoughts about using List<UInt8> for binary data:

Good parts:

  • We can probably optimize this by representing List<UInt8> as byte[] within our truffle implementation.
  • No new type introduced.

Concerns:

  • Ideally we'd want to export this as the corresponding byte array type in codegen (byte[] in Java, []byte in Go, ByteArray in Kotlin, [UInt8] in Swift). But there's nothing special about List<UInt8>
  • Would also need to special-case messagepack serialization? (should represent this using the binary message format, instead of the array message format).

bioball avatar Feb 03 '25 19:02 bioball

Submitted a SPICE for byte arrays, see https://github.com/apple/pkl-evolution/pull/14

bioball avatar Mar 06 '25 21:03 bioball

Superseded by #1019

HT154 avatar Jul 22 '25 03:07 HT154