array-api icon indicating copy to clipboard operation
array-api copied to clipboard

Adding `.view(...)`/reinterpret `dtype` method

Open jakirkham opened this issue 2 years ago • 6 comments

In NumPy (and some other libraries) arrays have a method to view the data as another dtype. This is different from astype as this taking data that may not be typed like bytes or bytearray and applying different dtype metadata on top of it. As an example reinterpreting the data in this way can be useful particularly in distributed setting where the data goes through serialization/deserialization steps where metadata is extracted, sent along, and then reapply to the data. Though this can come up in other situations as well.

cc @rgommers @kgryte (since we discussed this briefly earlier)

jakirkham avatar Sep 16 '21 17:09 jakirkham

As an example reinterpreting the data in this way can be useful particularly in distributed setting where the data goes through serialization/deserialization steps where metadata is extracted, sent along, and then reapply to the data.

This is different from numpy.ndarray.view right? In the latter case, there is already an array instance which must have a well-defined dtype, it's not just a block of memory. This particular example sounds closest to frombuffer. I'm left wondering a little why the deserialization doesn't use the correct metadata immediately though - can you point to a concrete example?

rgommers avatar Sep 16 '21 19:09 rgommers

One could do np.asarray(memoryview(buf)).view(fmt) for example. Though yes there are similarities to np.frombuffer

Because the memory is allocated to receive the message before any of that information arrives (it needs to be written somewhere in memory). Only after the metadata and data are stored, can they go through the deserialization process

jakirkham avatar Sep 16 '21 19:09 jakirkham

One could do np.asarray(memoryview(buf)).view(fmt) for example

Equivalent to np.asarray(memoryview(buf), dtype=fmt)?

I think I understand the use case, but there's no way to get an array that's untyped in the API, so the "reinterpret memory" use case seems quite niche. And I expect that there will be libraries that don't allow this kind of thing, because memory layout is an implementation detail not exposed to the user. So I'm leaning towards "out of scope" here.

It seems like serialization falls under I/O, which is out of scope completely.

rgommers avatar Sep 16 '21 20:09 rgommers

One could do np.asarray(memoryview(buf)).view(fmt) for example

Equivalent to np.asarray(memoryview(buf), dtype=fmt)?

Not if dtype=... means .astype(...). I think this gets back into our discussion earlier.

Maybe a short example helps? Imagine b is received over the wire along with relevant metadata. The data is three float32 numbers (IOW Out[3] is what we want).

In [1]: import numpy as np

In [2]: b = b"\x00\x00\x00\x00\x00\x00\x80?\x00\x00\x00@"

In [3]: np.asarray(memoryview(b)).view(np.float32)
Out[3]: array([0., 1., 2.], dtype=float32)

In [4]: np.asarray(memoryview(b), dtype=np.float32)
Out[4]: 
array([  0.,   0.,   0.,   0.,   0.,   0., 128.,  63.,   0.,   0.,   0.,   64.], dtype=float32)

I think I understand the use case, but there's no way to get an array that's untyped in the API, so the "reinterpret memory" use case seems quite niche.

In our usual case it is not so much that the data is untyped, but the type doesn't necessarily match what it should. Taking the example above, we have...

In [6]: np.asarray(memoryview(b)).dtype
Out[6]: dtype('uint8')

IOW we often have something that is uint8 or int8.

And I expect that there will be libraries that don't allow this kind of thing, because memory layout is an implementation detail not exposed to the user. So I'm leaning towards "out of scope" here.

For clarity, am not looking to manipulate the underlying memory in any way and don't really care how it is represented. Am just trying to patch on the correct formatting. Another way to think of this would be altering the dtype DLPack might use. Suppose one could hack around with the DLPack representation before it goes through the protocol, but that feels a bit clumsy.

It seems like serialization falls under I/O, which is out of scope completely.

It is certainly useful in I/O contexts (communication, file I/O, etc.). Though am not really looking for the protocol to handle the I/O portion or even serialization. Just the ability to perform this cast.

jakirkham avatar Sep 16 '21 21:09 jakirkham

Thanks, that is helpful. The "it has the wrong dtype" has come up in at least one other place I think, using DLPack to transfer bool arrays - those weren't supported, so it was done as uint8.

I think the next step here is figure out how other array libraries do this (if they allow it).

rgommers avatar Sep 16 '21 21:09 rgommers

Another use case for reinterpretation is ability to convert to and from the underlying byte representation of floating-point numbers.

This is common in the implementation of transcendental functions where you want to manipulate the underlying bits of a IEEE 754 floating-point number directly. Go, e.g., provides dedicated APIs for such reinterpretation (Float64bits and Float64frombits (albeit only operating on a single number)). JavaScript exposes an ArrayBuffer from which can instantiated typed array views allowing floating-point <=> bits reinterpretation.

The ability to reinterpret the underlying memory (i.e., have a data "view") can certainly be useful in certain classes of numerical algorithms and when you want to vectorize operations. The ability to reinterpret without needing to perform a copy would afford performance benefits.

Currently, the only way to achieve reinterpretation according to the specification is via either (1) manual iteration and data copy or (2) a combination of __dlpack__ and from_dlpack (see interchange), which may or may not involve data copy.

kgryte avatar Sep 20 '21 09:09 kgryte

cc @seberg (in case you have thoughts on this one :)

jakirkham avatar Oct 05 '22 22:10 jakirkham

For the use-case of reading blobs from the buffer protocol, I prefer the frombuffer API. OTOH, I guess Dask cannot export buffers and it doesn't match well for a "reinterpret cast" of an existing array. So there may be need for view as well (which is a bit more generic I guess?), although it seems less important.

seberg avatar Oct 06 '22 07:10 seberg

Think the main value of view is it allows reinterpreting an existing array and knowing the end array type will be the same (the dtype is ofc changed).

Whereas with frombuffer, asarray, etc., one needs to know the type of the array to call the right function. With a method, this confusion can be avoided.

jakirkham avatar Oct 06 '22 08:10 jakirkham

As this proposal is currently without a champion, I'll go ahead and close.

kgryte avatar Jun 29 '23 08:06 kgryte