rfcs icon indicating copy to clipboard operation
rfcs copied to clipboard

Add stateless alternative to buffered's Reader class

Open codec-abc opened this issue 7 years ago • 9 comments

The Reader class of the buffered package contains functions that are useful to read common basic type (such as F32, F64, U32, etc..) from a byte array. Unfortunately, those functions are not stateless and cannot be used in some wider context (my use case was a Array[U8 val] ref). It would be great to add stateless functions that Reader can be built on and Pony's users can reused too. They would have a signature similar to this one:

primitive BigEndian
primitive LittleEndian
type EndianMode is (BigEndian | LittleEndian)
fun box peek_u32(array: Array[U8] box, endianMode: EndianMode, offset: USize) : U32 ?

I suggest that it could be a great addition to add endian-less function that would pick automatically one of the two method depending on which architecture it is running to avoid code like this:

let x = 
  if is_running_le() then 
    peek_u32(my_array, 0, LittleEndian)?
  else
    peek_u32(my_array, 0, BigEndian)?
  end

codec-abc avatar Nov 29 '17 09:11 codec-abc

I definitely agree with the general idea, but I'd prefer to see a signature like this:

fun box peek_u32_be(seq: ReadSeq[U8] box, offset: USize): U32 ?
fun box peek_u32_le(seq: ReadSeq[U8] box, offset: USize): U32 ?
  1. Using ReadSeq instead of Array is more flexible.
  2. I'm not sure of the use case for a call site that need to use the endian-ness as an argument because it doesn't know the choice statically, and I think that pretty much the entire method body inside the method would be a match block that uses a different strategy based on the argument. In other words, I don't think it benefits the caller or the method implementation to do this, so I think it's an over-generalization of the abstraction. If we really wanted to proceed with this though, I'd prefer to see it as a type parameter rather than a type argument so that we aren't paying a runtime cost for the match and the argument.

I suggest that it could be a great addition to add endian-less function that would pick automatically one of the two method depending on which architecture

I don't think this is actually a common use case? I think pretty much everyone using a feature like this in Pony is using it to read either a network protocol or a file format, which is designed either as big-endian or little-endian as part of the protocol/format specification, regardless of the system architecture it's running on. So I'd be really surprised to see any code "in the wild" that doesn't make a static choice for big-endian vs little-endian.

jemc avatar Nov 29 '17 15:11 jemc

Totally agree with 1. For 2, I mostly agree and I think too that in most cases the choice between BigEndian and LittleEndian is known statically and it is better. From a theoretical point of view, it may however happen that a user wish to parse the same data either with BigEndian or LittleEndian depending on a factor known at runtime. It seems that TIFF files can be encoded in both format and a flag tell which one to use. So I guess it may be better that the functions in the std lib can handle that case too. By the way I didn't understood what you mean exactly by:

I'd prefer to see it as a type parameter rather than a type argument

About

I suggest that it could be a great addition to add endian-less function that would pick automatically one of the two method depending on which architecture

To give some context, this is a case where I would have needed it if I wanted to support multiple architecture for the same Pony app:

I am doing a wrapper for SDL2. To deal with events (and emulate polymorphism in C) SDL2 returns a union of different structs. The first byte, give the event type and can be used to cast the union to the correct structure. What I did (by doing the same thing that the Rust wrapper) is generate a byte array big enough to fit all the structs of the union and gave it to SDL2 via FFI. Then, once SDL2 has filled it, I read the first byte to know the event type and then read the content using functions like peek_u32 to retrieve the meaningful event's data from the byte array. Now, If I wanted the same application to run on LittleEndian (say x86/64) and BigEndian (ARM) I would need to check each time on which architecture the code is running to call the correct peek function. However, I do agree that this case is rather uncommon.

To sump up, I think the following functions covers all cases while maintaining a simple API (except maybe for the functions' names) with good performance for the common cases.

// The functions you probably want to use in 99% of cases.
fun box peek_u32_be(seq: ReadSeq[U8] box, offset: USize): U32 ?
fun box peek_u32_le(seq: ReadSeq[U8] box, offset: USize): U32 ?

// A one that just call one of the upper 2 depending of a parameter
type EndianMode is (BigEndian | LittleEndian)
fun box peek_u32(array: ReadSeq[U8] box, endianMode: EndianMode, offset: USize) : U32 ?

// And basically a one that retrieve the processor EndianMode and call the previous one
fun box peek_u32_processor_endian_mode(seq: ReadSeq[U8] box, offset: USize): U32 ?

codec-abc avatar Nov 29 '17 17:11 codec-abc

@codec-abc - Thanks for giving those concrete examples (TIFF and SDL2) to justify the motivation. You convinced me :wink:.

I like the idea you shared of providing both variants so that those who know the choice statically pay no extra perf cost, and get a more succinct syntax.

:+1:


I didn't understood what you mean exactly by:

I'd prefer to see it as a type parameter rather than a type argument

Whoops, that was a bit of a typo. I should have said "I'd prefer to see it as a type parameter rather than a runtime parameter". Something like this:

fun box peek_u32[E: EndianMode](array: Array[U8] box, offset: USize) : U32 ?

However, I'm okay with the approach you shared above in your latest comment, so I don't think that's necessary.

jemc avatar Nov 29 '17 17:11 jemc

Since this issue is focused on a very small point it shouldn't be too hard to write a proper RFC (and even the implementation after that) which I will try to do in the near future. But before writing a proper one, I would like your opinion -and anyone interested too- on the following points:

  • Do theses functions belong to the Buffered Reader package or should be put in another package (or file)?
  • What about the names of those functions? Should the word peek be in it? Moreover, peek_u32_processor_endian_mode is rather long but I fear shortening it will reduce readability.
  • A boolean can be used to choose between Big and Little Endian, but I don't like. I prefer the type EndianMode is (BigEndian | LittleEndian) approach because even without named parameters it is clear what is going on at the call site. Plus, using a boolean just saves a very few characters.
  • What about order of parameters and their names? Personally, I like having the offset at the end with a default value of 0.

codec-abc avatar Dec 02 '17 12:12 codec-abc

It's unclear to me what the actual changes are. I use Reader in a lot of high performance code so I'm very interested in the actual changes to it and the possible performance ramifications.

I'd rather see this does as a separate package/class that could be eventually brought together if we think that is appropriate.

Ideally, I'd like to see these exist as a 3rd party library first. Evolve some there (as I suspect they are going to be prone to some API changes early on) and then moved over for inclusion in the standard library.

SeanTAllen avatar Dec 02 '17 16:12 SeanTAllen

The changes are to add the following stateless functions to the standard library

fun peek_X_be(seq: ReadSeq[U8] box, offset: USize): X?
fun peek_X_le(seq: ReadSeq[U8] box, offset: USize): X?
fun peek_X[E: EndianMode](array: ReadSeq[U8] box, offset: USize): X?
fun peek_X_processor_endian_mode(seq: ReadSeq[U8] box, offset: USize): X?

for any X in

(U16 | I16 | U32 | I32 | U64 | I64 | U128 | I128 | F32 | F64 | String)

and nothing more. Obviously, The Reader class could use it internally (as it provide more abstraction about reading data from a chunk of bytes) but it is not even necessary.

codec-abc avatar Dec 02 '17 18:12 codec-abc

Reader is stateful. If you are proposing adding stateless methods to it, while leaving the rest of Reader stateful, I think that's a confusing API. I think stateless methods such as these would make sense in a primitive of their own.

SeanTAllen avatar Dec 02 '17 19:12 SeanTAllen

I do agree. Also I think it makes sense to add the reverse ones at the same time, ie

fun x_to_U8_be(x: X): ReadSeq[U8]
fun x_to_U8_le(x: X): ReadSeq[U8]
fun x_to_U8[E: EndianMode](x: X): ReadSeq[U8]
fun x_to_U8_processor_endian_mode(x: X): ReadSeq[U8]

codec-abc avatar Dec 02 '17 19:12 codec-abc

@SeanTAllen - As I understood it, the proposal was not to change the way buffered.Reader works, but to provide a new primitive that has the read-bytes-into-data-types features of buffered.Reader, but without the actual stateful buffering features that buffered.Reader provides.

That is, the idea is that you should be able to, for example, read a big-endian I32 from a particular offset in a ReadSeq[U8] without doing any additional allocations related to buffering and interceding for multiple chunks like buffered.Reader does.

jemc avatar Dec 04 '17 20:12 jemc