rfcs Add stateless alternative to buffered's Reader class

The Reader class of the buffered package contains functions that are useful to read common basic type (such as F32, F64, U32, etc..) from a byte array. Unfortunately, those functions are not stateless and cannot be used in some wider context (my use case was a Array[U8 val] ref). It would be great to add stateless functions that Reader can be built on and Pony's users can reused too. They would have a signature similar to this one:

primitive BigEndian
primitive LittleEndian
type EndianMode is (BigEndian | LittleEndian)
fun box peek_u32(array: Array[U8] box, endianMode: EndianMode, offset: USize) : U32 ?

I suggest that it could be a great addition to add endian-less function that would pick automatically one of the two method depending on which architecture it is running to avoid code like this:

let x = 
  if is_running_le() then 
    peek_u32(my_array, 0, LittleEndian)?
  else
    peek_u32(my_array, 0, BigEndian)?
  end

Nov 29 '17 09:11 codec-abc

I definitely agree with the general idea, but I'd prefer to see a signature like this:

fun box peek_u32_be(seq: ReadSeq[U8] box, offset: USize): U32 ?
fun box peek_u32_le(seq: ReadSeq[U8] box, offset: USize): U32 ?

Using ReadSeq instead of Array is more flexible.
I'm not sure of the use case for a call site that need to use the endian-ness as an argument because it doesn't know the choice statically, and I think that pretty much the entire method body inside the method would be a match block that uses a different strategy based on the argument. In other words, I don't think it benefits the caller or the method implementation to do this, so I think it's an over-generalization of the abstraction. If we really wanted to proceed with this though, I'd prefer to see it as a type parameter rather than a type argument so that we aren't paying a runtime cost for the match and the argument.

I suggest that it could be a great addition to add endian-less function that would pick automatically one of the two method depending on which architecture

I don't think this is actually a common use case? I think pretty much everyone using a feature like this in Pony is using it to read either a network protocol or a file format, which is designed either as big-endian or little-endian as part of the protocol/format specification, regardless of the system architecture it's running on. So I'd be really surprised to see any code "in the wild" that doesn't make a static choice for big-endian vs little-endian.

Nov 29 '17 15:11 jemc

Totally agree with 1. For 2, I mostly agree and I think too that in most cases the choice between BigEndian and LittleEndian is known statically and it is better. From a theoretical point of view, it may however happen that a user wish to parse the same data either with BigEndian or LittleEndian depending on a factor known at runtime. It seems that TIFF files can be encoded in both format and a flag tell which one to use. So I guess it may be better that the functions in the std lib can handle that case too. By the way I didn't understood what you mean exactly by:

I'd prefer to see it as a type parameter rather than a type argument

About

I suggest that it could be a great addition to add endian-less function that would pick automatically one of the two method depending on which architecture

To give some context, this is a case where I would have needed it if I wanted to support multiple architecture for the same Pony app:

I am doing a wrapper for SDL2. To deal with events (and emulate polymorphism in C) SDL2 returns a union of different structs. The first byte, give the event type and can be used to cast the union to the correct structure. What I did (by doing the same thing that the Rust wrapper) is generate a byte array big enough to fit all the structs of the union and gave it to SDL2 via FFI. Then, once SDL2 has filled it, I read the first byte to know the event type and then read the content using functions like peek_u32 to retrieve the meaningful event's data from the byte array. Now, If I wanted the same application to run on LittleEndian (say x86/64) and BigEndian (ARM) I would need to check each time on which architecture the code is running to call the correct peek function. However, I do agree that this case is rather uncommon.

To sump up, I think the following functions covers all cases while maintaining a simple API (except maybe for the functions' names) with good performance for the common cases.

// The functions you probably want to use in 99% of cases.
fun box peek_u32_be(seq: ReadSeq[U8] box, offset: USize): U32 ?
fun box peek_u32_le(seq: ReadSeq[U8] box, offset: USize): U32 ?

// A one that just call one of the upper 2 depending of a parameter
type EndianMode is (BigEndian | LittleEndian)
fun box peek_u32(array: ReadSeq[U8] box, endianMode: EndianMode, offset: USize) : U32 ?

// And basically a one that retrieve the processor EndianMode and call the previous one
fun box peek_u32_processor_endian_mode(seq: ReadSeq[U8] box, offset: USize): U32 ?

Nov 29 '17 17:11 codec-abc

@codec-abc - Thanks for giving those concrete examples (TIFF and SDL2) to justify the motivation. You convinced me :wink:.

I like the idea you shared of providing both variants so that those who know the choice statically pay no extra perf cost, and get a more succinct syntax.

:+1:

I didn't understood what you mean exactly by:

I'd prefer to see it as a type parameter rather than a type argument

Whoops, that was a bit of a typo. I should have said "I'd prefer to see it as a type parameter rather than a runtime parameter". Something like this:

fun box peek_u32[E: EndianMode](array: Array[U8] box, offset: USize) : U32 ?

However, I'm okay with the approach you shared above in your latest comment, so I don't think that's necessary.

Nov 29 '17 17:11 jemc

Since this issue is focused on a very small point it shouldn't be too hard to write a proper RFC (and even the implementation after that) which I will try to do in the near future. But before writing a proper one, I would like your opinion -and anyone interested too- on the following points:

Do theses functions belong to the Buffered Reader package or should be put in another package (or file)?
What about the names of those functions? Should the word peek be in it? Moreover, peek_u32_processor_endian_mode is rather long but I fear shortening it will reduce readability.
A boolean can be used to choose between Big and Little Endian, but I don't like. I prefer the type EndianMode is (BigEndian | LittleEndian) approach because even without named parameters it is clear what is going on at the call site. Plus, using a boolean just saves a very few characters.
What about order of parameters and their names? Personally, I like having the offset at the end with a default value of 0.

Dec 02 '17 12:12 codec-abc

It's unclear to me what the actual changes are. I use Reader in a lot of high performance code so I'm very interested in the actual changes to it and the possible performance ramifications.

I'd rather see this does as a separate package/class that could be eventually brought together if we think that is appropriate.

Ideally, I'd like to see these exist as a 3rd party library first. Evolve some there (as I suspect they are going to be prone to some API changes early on) and then moved over for inclusion in the standard library.

Dec 02 '17 16:12 SeanTAllen

The changes are to add the following stateless functions to the standard library

fun peek_X_be(seq: ReadSeq[U8] box, offset: USize): X?
fun peek_X_le(seq: ReadSeq[U8] box, offset: USize): X?
fun peek_X[E: EndianMode](array: ReadSeq[U8] box, offset: USize): X?
fun peek_X_processor_endian_mode(seq: ReadSeq[U8] box, offset: USize): X?

for any X in

(U16 | I16 | U32 | I32 | U64 | I64 | U128 | I128 | F32 | F64 | String)

and nothing more. Obviously, The Reader class could use it internally (as it provide more abstraction about reading data from a chunk of bytes) but it is not even necessary.

Dec 02 '17 18:12 codec-abc

Reader is stateful. If you are proposing adding stateless methods to it, while leaving the rest of Reader stateful, I think that's a confusing API. I think stateless methods such as these would make sense in a primitive of their own.

Dec 02 '17 19:12 SeanTAllen

I do agree. Also I think it makes sense to add the reverse ones at the same time, ie

fun x_to_U8_be(x: X): ReadSeq[U8]
fun x_to_U8_le(x: X): ReadSeq[U8]
fun x_to_U8[E: EndianMode](x: X): ReadSeq[U8]
fun x_to_U8_processor_endian_mode(x: X): ReadSeq[U8]

Dec 02 '17 19:12 codec-abc

@SeanTAllen - As I understood it, the proposal was not to change the way buffered.Reader works, but to provide a new primitive that has the read-bytes-into-data-types features of buffered.Reader, but without the actual stateful buffering features that buffered.Reader provides.

That is, the idea is that you should be able to, for example, read a big-endian I32 from a particular offset in a ReadSeq[U8] without doing any additional allocations related to buffering and interceding for multiple chunks like buffered.Reader does.

Dec 04 '17 20:12 jemc

rfcs rfcs copied to clipboard

Add stateless alternative to buffered's Reader class

rfcs
rfcs copied to clipboard