micropython-ulab
micropython-ulab copied to clipboard
numpy.load() feature upgrade
numpy.load() feature upgrade Hi, Can you please add a very useful feature of using memmap to load only a part of numpy array from a file? For example
- Let's create a memory-mapped array in write mode:
import numpy as np
nrows, ncols = 1000000, 100
f = np.memmap('memmapped.dat', dtype=np.float32,
mode='w+', shape=(nrows, ncols))
- Let's feed the array with random values, one column at a time because our system's memory is limited!
for i in range(ncols):
f[:, i] = np.random.rand(nrows)
x = f[:, -1]
del
### READING
f = np.memmap('memmapped.dat', dtype=np.float32,
shape=(nrows, ncols))
np.array_equal(f[:, -1], x)
True
del f
Additional context (https://numpy.org/doc/stable/reference/generated/numpy.load.html) like using numpy.memmap inside numpy.load()
I believe this is actually much more than just reading part of the file, at least, this is what I understand from this: https://numpy.org/doc/stable/reference/generated/numpy.memmap.html. Basically, you don't load anything with memmap, you just create a pointer to data on the disc, so if you take the method sum as an example, sum has to know how to handle data that are not stored in RAM, and that is highly non-trivial.
Can you implement a way to save the numpy arrays in append mode. Similarly, reading the numpy partial subarray of the numpy with somekind of 'offset' variable.
Can you point to the relevant documentation?
For appending arrays, there is a library. It is not part of the official numpy docs. https://pypi.org/project/npy-append-array/
for reading I haven't seen other way implemented than h5py or numpy.memmap https://numpy.org/doc/stable/reference/generated/numpy.load.html
I feel that we're rapidly going off-tangent, but still, here are a couple of comments:
- https://github.com/v923z/micropython-ulab/pull/327 implements more or less what you want. As I said, your request is not trivial, and we have to tread carefully here. It's no accident that that hasn't yet been merged, but we could dust it off.
- As you pointed out, npy-append-array is not part of
numpy, which leads me to the question, whether what you would like could/should be implemented not at the C-level, but inpython. If so, the next question is, what would you need for that. Would it help, if you had a method that simply lays bare the binary contents of anndarray's pointer, which you could then write to a file frompython? We could turn the methods of https://github.com/v923z/micropython-ulab/blob/master/code/utils/utils.c, and add one that gets you thendarray. You would then manipulate the header or your.npyfile from python.
-
I need it for ndarray file reading. Its basically audio data that I'm loading from sdcard into ESP32. I have implemented file handling in micropython. but its slow as compared to normal numpy.load() . numpy.load() roughly takes around 4ms usually for the size of data. and my implementation takes around 50ms.
-
I have made a rudimentary python based implementation that puts a header in the beginning of a binary file. the header needs to edited in append mode which slows down the write process.
a method that simply lays bare the binary contents of an ndarray's pointer, which you could then write to a file from python? We could turn the methods of https://github.com/v923z/micropython-ulab/blob/master/code/utils/utils.c, and add one that gets you the ndarray. You would then manipulate the header or your .npy file from python.
This will help a lot. Anything lays out pointers to rows is good enough.
It's not quite clear to me what your vision for such a function would be. The way you describe it seems to indicate that you'd need access to data that is not contiguous. Is that the case?
let's say I have my data stored in ndarray (1000,7) in a file. I want to retrieve only a block (10,7) from file without bringing the whole ndarray into the memory.
The function should be able to allow to read some block of rows. The function I have implemented in Python allows to read contiguous rows from the file only.
def filereader( rows_to_read = 1, offset_index = 0):
OK, so one thing we could do is add the numpy-incompatible keywords offset and count to load, so that you could start from a particular place, and read a given number of values.
There might be an issue, and I don't quite know how to handle that: if you want to add offset and count, then you have to know beforehand what the shape in the file is, otherwise, you might request something that's not compatible with the contents of the file.
I have created a header struct in my python struct which keeps track of array dtype and array shape.
header_format = "BBHH"
header_data = ustruct.pack(header_format, byte_size, array.dtype, row_dimension, column_dimension)
Moreover, it's better that the write operation mode should be only overwrite mode so that we don't have to edit the header again and again.
What you're saying here doesn't address the issue I mentioned earlier. If we add a keyword or something like that to load, then we cannot rely on the fact that you know everything about the file that you're going to read. So, if the file contains data that were of the shape (4, 4, 4), which is 16 entries, but you're trying to read into a shape (2, 5), what should happen?
Also, the title of this thread is "numpy.load() feature upgrade", so we shouldn't talk about write operations here. Even memmap is about reading from a file, and not writing to it. I have the feeling that we're dealing with a feature creep here. Could you, please, define exactly what this new feature of the load function should do?
We might actually be better off adding the function to utils, if you really need it.
I wrote an implementation of .npy file loading/saving for MicroPython - which also supports streaming reading of data. The streaming API is different than the numpy.load() one - to allow accessing/validating the metadata/structure information before actually reading the data. https://github.com/jonnor/micropython-npyfile?tab=readme-ov-file#streaming-read