stdlib
stdlib copied to clipboard
[RFC]: a simple flat array format for ndarrays
Description
This RFC proposes introducing a simple flat array format for ndarrays and is inspired by work involving the integration of stdlib in Google Sheets. The motivation for this RFC is to provide a human-readable, non-binary format for serializing and deserializing ndarrays, which is JSON compatible.
At a high level, the format is comprised of a version, header, and list of data buffer elements.
<version> | <header> | <data>
The version
component would be comprised of two elements:
[ 'version', '<semver>', ... ]
The first element is the string literal 'version'
and is followed by a version string in semver format. It is not anticipated that the patch field of the version string will be used. Only major (breaking changes) and minor (new features/header fields) version fields should update over time.
The header
component would be comprised as follows:
'ndarray' | shape | strides | offset | order | dtype | length | capacity | 'data'
and as part of the serialized array
[ ..., 'ndarray', 'shape', ...shape, 'strides', ...strides, 'offset', offset, 'order', order, 'dtype', dtype, 'length', length, 'capacity', capacity, 'data', ... ]
where
-
'ndarray'
is the string literal'ndarray'
. -
'shape'
is the string literal'shape'
. -
...shape
is 0 or more dimension sizes. For a zero-dimensional array, no dimension sizes should be present. -
'strides'
is the string literal'strides'
. -
...strides
is 1 or more dimension strides. For a zero-dimensional array, one stride should be present, which should be equal to0
. -
'offset'
is the string literal'offset'
. -
offset
is a nonnegative integer indicating the index offset in the data buffer marking the first indexed element. The offset of the first indexed element in the serialized format would beversion_length + header_length + offset
, where one must take into account the version and header lengths. -
'order'
is the string literal'order'
. -
order
is either'row-major'
or'column-major'
. -
'dtype'
is the string literal'dtype'
. -
dtype
is the ndarray data type string (e.g.,'float64'
,'complex128'
,'int32'
, etc). -
'length'
is the string literal'length'
. -
length
is a nonnegative integer indicating how many elements are indexed by the ndarray. For a zero-dimensional array, this should equal1
. For non-zero-dimensional arrays, this should be equal to the product of dimension sizes, as listed inshape
. -
'capacity'
is the string literal'capacity'
. -
capacity
is a nonnegative integer indicating how many elements are in the data buffer. This value should be compatible with the specified ndarray meta data (i.e., shape, strides, offset). For zero-dimensional arrays, this should be greater than or equal to1
. -
'data'
is the string literal'data'
and should be followed by data buffer elements.
The 'ndarray'
string literal is required to be the first header element. The 'data'
string literal is required to be the last header element. For the other header elements, each string literal and associated value pair can be arranged in any order. E.g.,
[ ..., 'ndarray', 'capacity', capacity, 'length', length, 'dtype', dtype, 'order', order, 'offset', offset, 'strides', ...strides, 'shape', ...shape, 'data', ... ]
would be valid. Parsers should not assume any particular string literal and value pair order and should instead identify a sub-header element by the string literal indicating its beginning.
The data
component is the linear data buffer atop which the serialized ndarray is a view. This data buffer is allowed to contain elements which are outside the view bounds and are not indexed by the view.
Example
The following is an example of a 2x2 ndarray serialized to the proposed linear exchange format:
[
'version',
'1.0.0',
'ndarray',
'shape',
2,
2,
'strides',
2,
1,
'offset',
0,
'order',
'row-major',
'dtype',
'float64',
'length',
4,
'capacity',
4,
'data',
1,
2,
3,
4
]
Note that this particular linear format is easily extendable to CSV/DSV serialization, where each column could represent a different ndarray.
Proposal
As part of this RFC, the following packages are proposed
-
@stdlib/ndarray/[base/]to-linear-exchange-format
: serializes an ndarray to the proposed format. -
@stdlib/ndarray/[base/]from-linear-exchange-format
: converts a serialized ndarray to an ndarray instance.
where [base/]
indicates both base
and non-base
package versions.
The format name and associated package names are not set in stone. Any naming suggestions are welcome.
Prior Art
-
ndarray
objects can already be serialized to JSON, using thendarray#toJSON
method; however, the format is not a linear data structure (nor should it necessarily be) and does not serialize the data buffer outside of the array view. This prevents creating subsequent views of different sizes atop the same data buffer. - NumPy has an
*.npy
format; however, this does not include some of the meta data proposed in this RFC and is not human-readable. - NumPy also has an API,
savetxt
for saving an ndarray to text; however, this is primarily oriented to formatting, similar to@stdlib/string/format
.
Related Issues
None.
Questions
No.
Other
No.
Checklist
- [X] I have read and understood the Code of Conduct.
- [X] Searched for existing issues and pull requests.
- [X] The issue name begins with
RFC:
.
cc @Planeshifter
@kgryte From what I understand, we just have to write methods to serialize and deserialize a multi-dimensional ndarray into a linear array that preserves the original metadata right? Is this issue open for contribution? I would like to work on this.
@Snehil-Shah Let me circle back on this, as we may already have an implementation written elsewhere.
@Snehil-Shah Let me circle back on this, as we may already have an implementation written elsewhere.
Cool, I was also interested in contributing to the Google Sheets integration project. If you don't mind, can you point me to the hows and wheres? Thanks!
@Snehil-Shah For that, see https://github.com/stdlib-js/gsheets.