npy-rs
npy-rs copied to clipboard
Supporting 1 rust type <-> many npy types
There are many datatypes which can appear in NPY files which currently cannot be supported by the design of the Serialize
trait. To name a few:
- Anything that's big-endian.
- Datetimes and timedeltas (
<M8[ns]
,<m8[ns]
), which can have a variety of units. - The
|S
format forbytes
(python 2:str
) objects, the<U
format forstr
(python 2:unicode
) objects, and the|V
format for binary blobs, all of which can have arbitrary data sizes.
In order to support any of these, we need to get rid of the "one type string <--> one type" way of thinking. Reading and writing needs to depend on the type string:
// was:
pub trait Serializable : Sized {
fn dtype() -> DType;
fn n_bytes() -> usize;
fn read(c: &[u8]) -> Self;
fn write<W: Write>(&self, writer: &mut W) -> Result<()>;
}
// must now instead be something closer to:
pub trait Serializable : Sized {
fn recommended_dtype() -> DType;
fn read(&Dtype, &[u8]) -> Self;
fn write<W: Write>(&self, &Dtype, &mut W) -> Result<()>;
}
impl DType {
pub fn n_bytes(&self) -> usize;
}
Of course, having every read
or write
dispatch on the DType could incur a performance loss, so the above is not precisely the design we want.
I've been prototyping a solution myself, which I'll write about in a separate comment.
I am currently prototyping replacement traits for Serializable
in my own fork.
At the moment, they are completely unused by the crate (I've built them from the bottom up in order to see what was feasible, and I plan to integrate them into the crate later). The bulk of the changes are in these files:
In just a couple of days they've already had to undergo numerous broad changes. At the time of writing, they look like this:
// implemented by i32, u64, Vec<u8>... but not [u8]
pub trait Deserialize: Sized {
type Reader: TypeRead<Value=Self>;
fn reader(type_str: &TypeStr) -> Result<Self::Reader, TypeStrUnsupportedError>;
}
// implemented by i32, u64, [u8]...
pub trait Serialize {
type Writer: TypeWrite<Value=Self>;
fn writer(type_str: &TypeStr) -> Result<Self::Writer, TypeStrUnsupportedError>;
}
// Not implemented for Vec<u8>, which can become either `|Sn` or `|Vn`.
pub trait AutoSerialize: Serialize {
fn default_type_str() -> TypeStr;
}
// basically Fn(&[u8]) -> T with some overloadable methods for efficiency
pub trait TypeRead {
...
}
// sort of like: for<W: io::Write> Fn(W, &T) -> io::Result<()>,
// if that type was expressible in rust
pub trait TypeWrite {
...
}
Some things that the new design can do:
- Serializing i32 to an endianness of choice
- Serializing
&[u8]
to|S
or|V
, and reading them back asVec<u8>
. - Reading/writing
np.datetime64
asu64
andnp.timedelta64
asi64
.
(it also used to support e.g. reading |u4
as u64
or even i64
(types that impl From<u32>
) but I couldn't find any way to make it easy to maintain, so I ripped it out)
It's reaching the point where I need to start figuring out how to fit it into the existing design, supporting the current usage of arrays and #[derive]
.
If the performance impact ends up being significant, then under this design it should be possible to introduce wrapper types like Le<i32>
that only support a single endianness, and then add field annotations like #[npy(le)]
to let them be used invisibly by the derive
s:
#[derive(npy::Serialize, npy::Deserialize)]
struct MyStruct {
#[npy(le)] foo: i32,
#[npy(le)] bar: i64,
}
...though I'm hoping it doesn't come to that.
Any thoughts, suggestions, or critique is welcome.