npy-rs Supporting 1 rust type <-> many npy types

Supporting 1 rust type <-> many npy types

Open ExpHP opened this issue 5 years ago • 1 comments

There are many datatypes which can appear in NPY files which currently cannot be supported by the design of the Serialize trait. To name a few:

Anything that's big-endian.
Datetimes and timedeltas (<M8[ns], <m8[ns]), which can have a variety of units.
The |S format for bytes (python 2: str) objects, the <U format for str (python 2: unicode) objects, and the |V format for binary blobs, all of which can have arbitrary data sizes.

In order to support any of these, we need to get rid of the "one type string <--> one type" way of thinking. Reading and writing needs to depend on the type string:

// was:
pub trait Serializable : Sized {
    fn dtype() -> DType;
    fn n_bytes() -> usize;
    fn read(c: &[u8]) -> Self;
    fn write<W: Write>(&self, writer: &mut W) -> Result<()>;
}

// must now instead be something closer to:
pub trait Serializable : Sized {
    fn recommended_dtype() -> DType;
    fn read(&Dtype, &[u8]) -> Self;
    fn write<W: Write>(&self, &Dtype, &mut W) -> Result<()>;
}
impl DType {
    pub fn n_bytes(&self) -> usize;
}

Of course, having every read or write dispatch on the DType could incur a performance loss, so the above is not precisely the design we want.

I've been prototyping a solution myself, which I'll write about in a separate comment.

Jun 03 '19 00:06 ExpHP

I am currently prototyping replacement traits for Serializable in my own fork.

At the moment, they are completely unused by the crate (I've built them from the bottom up in order to see what was feasible, and I plan to integrate them into the crate later). The bulk of the changes are in these files:

In just a couple of days they've already had to undergo numerous broad changes. At the time of writing, they look like this:

// implemented by i32, u64, Vec<u8>... but not [u8]
pub trait Deserialize: Sized {
    type Reader: TypeRead<Value=Self>;

    fn reader(type_str: &TypeStr) -> Result<Self::Reader, TypeStrUnsupportedError>;
}

// implemented by i32, u64, [u8]...
pub trait Serialize {
    type Writer: TypeWrite<Value=Self>;

    fn writer(type_str: &TypeStr) -> Result<Self::Writer, TypeStrUnsupportedError>;
}

// Not implemented for Vec<u8>, which can become either `|Sn` or `|Vn`.
pub trait AutoSerialize: Serialize {
    fn default_type_str() -> TypeStr;
}

// basically Fn(&[u8]) -> T with some overloadable methods for efficiency
pub trait TypeRead {
    ...
}

// sort of like:  for<W: io::Write> Fn(W, &T) -> io::Result<()>,
// if that type was expressible in rust
pub trait TypeWrite {
    ...
}

Some things that the new design can do:

Serializing i32 to an endianness of choice
Serializing &[u8] to |S or |V, and reading them back as Vec<u8>.
Reading/writing np.datetime64 as u64 and np.timedelta64 as i64.

(it also used to support e.g. reading |u4 as u64 or even i64 (types that impl From<u32>) but I couldn't find any way to make it easy to maintain, so I ripped it out)

It's reaching the point where I need to start figuring out how to fit it into the existing design, supporting the current usage of arrays and #[derive].

If the performance impact ends up being significant, then under this design it should be possible to introduce wrapper types like Le<i32> that only support a single endianness, and then add field annotations like #[npy(le)] to let them be used invisibly by the derives:

#[derive(npy::Serialize, npy::Deserialize)]
struct MyStruct {
    #[npy(le)] foo: i32,
    #[npy(le)] bar: i64,
}

...though I'm hoping it doesn't come to that.

Any thoughts, suggestions, or critique is welcome.

Jun 03 '19 00:06 ExpHP

npy-rs npy-rs copied to clipboard

Supporting 1 rust type <-> many npy types

npy-rs
npy-rs copied to clipboard