micropython-ulab icon indicating copy to clipboard operation
micropython-ulab copied to clipboard

numpy.load() feature upgrade

Open hamza-712 opened this issue 2 years ago • 12 comments

numpy.load() feature upgrade Hi, Can you please add a very useful feature of using memmap to load only a part of numpy array from a file? For example

  1. Let's create a memory-mapped array in write mode:
import numpy as np
nrows, ncols = 1000000, 100
f = np.memmap('memmapped.dat', dtype=np.float32,
              mode='w+', shape=(nrows, ncols))
  1. Let's feed the array with random values, one column at a time because our system's memory is limited!
for i in range(ncols):
    f[:, i] = np.random.rand(nrows)
x = f[:, -1]
del

### READING

f = np.memmap('memmapped.dat', dtype=np.float32,
              shape=(nrows, ncols))
np.array_equal(f[:, -1], x)
True
del f

Additional context (https://numpy.org/doc/stable/reference/generated/numpy.load.html) like using numpy.memmap inside numpy.load()

hamza-712 avatar Aug 09 '23 10:08 hamza-712

I believe this is actually much more than just reading part of the file, at least, this is what I understand from this: https://numpy.org/doc/stable/reference/generated/numpy.memmap.html. Basically, you don't load anything with memmap, you just create a pointer to data on the disc, so if you take the method sum as an example, sum has to know how to handle data that are not stored in RAM, and that is highly non-trivial.

v923z avatar Aug 09 '23 12:08 v923z

Can you implement a way to save the numpy arrays in append mode. Similarly, reading the numpy partial subarray of the numpy with somekind of 'offset' variable.

hamza-712 avatar Aug 11 '23 20:08 hamza-712

Can you point to the relevant documentation?

v923z avatar Aug 13 '23 16:08 v923z

For appending arrays, there is a library. It is not part of the official numpy docs. https://pypi.org/project/npy-append-array/

for reading I haven't seen other way implemented than h5py or numpy.memmap https://numpy.org/doc/stable/reference/generated/numpy.load.html

hamza-712 avatar Aug 13 '23 19:08 hamza-712

I feel that we're rapidly going off-tangent, but still, here are a couple of comments:

  1. https://github.com/v923z/micropython-ulab/pull/327 implements more or less what you want. As I said, your request is not trivial, and we have to tread carefully here. It's no accident that that hasn't yet been merged, but we could dust it off.
  2. As you pointed out, npy-append-array is not part of numpy, which leads me to the question, whether what you would like could/should be implemented not at the C-level, but in python. If so, the next question is, what would you need for that. Would it help, if you had a method that simply lays bare the binary contents of an ndarray's pointer, which you could then write to a file from python? We could turn the methods of https://github.com/v923z/micropython-ulab/blob/master/code/utils/utils.c, and add one that gets you the ndarray. You would then manipulate the header or your .npy file from python.

v923z avatar Aug 14 '23 18:08 v923z

  1. I need it for ndarray file reading. Its basically audio data that I'm loading from sdcard into ESP32. I have implemented file handling in micropython. but its slow as compared to normal numpy.load() . numpy.load() roughly takes around 4ms usually for the size of data. and my implementation takes around 50ms.

  2. I have made a rudimentary python based implementation that puts a header in the beginning of a binary file. the header needs to edited in append mode which slows down the write process.

a method that simply lays bare the binary contents of an ndarray's pointer, which you could then write to a file from python? We could turn the methods of https://github.com/v923z/micropython-ulab/blob/master/code/utils/utils.c, and add one that gets you the ndarray. You would then manipulate the header or your .npy file from python.

This will help a lot. Anything lays out pointers to rows is good enough.

hamza-712 avatar Aug 21 '23 18:08 hamza-712

It's not quite clear to me what your vision for such a function would be. The way you describe it seems to indicate that you'd need access to data that is not contiguous. Is that the case?

v923z avatar Aug 27 '23 08:08 v923z

let's say I have my data stored in ndarray (1000,7) in a file. I want to retrieve only a block (10,7) from file without bringing the whole ndarray into the memory.

The function should be able to allow to read some block of rows. The function I have implemented in Python allows to read contiguous rows from the file only.

def filereader( rows_to_read = 1, offset_index = 0):

hamza-712 avatar Aug 27 '23 16:08 hamza-712

OK, so one thing we could do is add the numpy-incompatible keywords offset and count to load, so that you could start from a particular place, and read a given number of values.

There might be an issue, and I don't quite know how to handle that: if you want to add offset and count, then you have to know beforehand what the shape in the file is, otherwise, you might request something that's not compatible with the contents of the file.

v923z avatar Aug 28 '23 14:08 v923z

I have created a header struct in my python struct which keeps track of array dtype and array shape.

        header_format = "BBHH"
        header_data = ustruct.pack(header_format, byte_size, array.dtype, row_dimension, column_dimension)

Moreover, it's better that the write operation mode should be only overwrite mode so that we don't have to edit the header again and again.

hamza-712 avatar Aug 29 '23 09:08 hamza-712

What you're saying here doesn't address the issue I mentioned earlier. If we add a keyword or something like that to load, then we cannot rely on the fact that you know everything about the file that you're going to read. So, if the file contains data that were of the shape (4, 4, 4), which is 16 entries, but you're trying to read into a shape (2, 5), what should happen?

Also, the title of this thread is "numpy.load() feature upgrade", so we shouldn't talk about write operations here. Even memmap is about reading from a file, and not writing to it. I have the feeling that we're dealing with a feature creep here. Could you, please, define exactly what this new feature of the load function should do?

We might actually be better off adding the function to utils, if you really need it.

v923z avatar Aug 29 '23 12:08 v923z

I wrote an implementation of .npy file loading/saving for MicroPython - which also supports streaming reading of data. The streaming API is different than the numpy.load() one - to allow accessing/validating the metadata/structure information before actually reading the data. https://github.com/jonnor/micropython-npyfile?tab=readme-ov-file#streaming-read

jonnor avatar Aug 11 '24 19:08 jonnor