array-api icon indicating copy to clipboard operation
array-api copied to clipboard

Discussion: DLPack requires 256B alignment of data pointers

Open leofang opened this issue 2 years ago • 19 comments

Recently we found that DLPack has this requirement noted in the header: https://github.com/dmlc/dlpack/blob/a02bce20e2dfa0044b9b2ef438e8eaa7b0f95e96/include/dlpack/dlpack.h#L139-L141. Would this be an issue for all adopting libraries? As far as I know, CuPy doesn't do any alignment check and take the pointer (DLTensor.data) as is, and IIRC quite a few other libraries are also doing this.

cc: @seberg @rgommers @tqchen @kmaehashi @oleksandr-pavlyk @jakirkham @edloper @szha

leofang avatar Oct 25 '21 14:10 leofang

I have a vague memory of this alignment requirement being pointed out before and the answer being that it's unnecessary and can be ignored. Some digging turned up https://github.com/numpy/numpy/issues/19013#issuecomment-841745443, where you (@leofang) said: "please ignore the alignment stuff. If we are talking about host-device copies, it's an implementation detail (of CUDA, HIP, or any sane memory pool implementation sitting on top). CUDA/HIP's copy APIs do not care about alignments. DLPack does not. CUDA Array Interface does not. No one does."

rgommers avatar Oct 25 '21 14:10 rgommers

Thanks, Ralf, yes now I recall, I guess it's a rediscovery then 😅 Maybe I just need to add a note to this page for better visibility? https://data-apis.org/array-api/latest/design_topics/data_interchange.html

leofang avatar Oct 25 '21 14:10 leofang

This is certainly something that worth discussion.

To give background of the original rationale in DLPack.

Such requirement is not necessary for host/device copies of any kind as we rightfully point out that copy APIs does not care about alignments. The alignment spec comes in handy when we build kernels that leverages those data structure, as vectorization usually have higher amount of alignment requirements.

Of course there is a need to support sliced array, which do not have the proper alignment. Similar problems will happen in the context of OpenCL, where the data ptr is opaque(you cannot add an offset directly to that field).

The byte_offset field was meant to enable such offset in the context of OpenCL, and other cases where offsetting can cause not aligned array.

tqchen avatar Oct 25 '21 15:10 tqchen

The alignment spec comes in handy when we build kernels that leverages those data structure, as vectorization usually have higher amount of alignment requirements.

Naive question - is there any advantage of enforcing 256 byte alignment plus always checking byte_offset, over not requiring alignment at all, or only alignment on a more "normal" boundary? 256 byte is pretty unusual I think - the linked issue for example explains that Arrow requires 8 byte and recommends 64 byte alignment. I can't remember seeing something >64 byte before elsewhere.

And if there is such an advantage, I guess you're proposing to start checking it in each implementation? This may require a slow path, with emitting warnings for at least a year before starting to raise an exception.

rgommers avatar Oct 25 '21 21:10 rgommers

Larger alignment generally have an advantage of making vectorization simple(in the presence of avx2 or 512) in the case of CPU. Similarly in the case of GPU can inform a more aligned loads in float4. The particular number was picked mainly because it was CUDA's Malloc default alignment.

I agree start with a warning would be useful, given most do not start with such an assumption.

tqchen avatar Oct 26 '21 02:10 tqchen

Of course there is a need to support sliced array, which do not have the proper alignment.

Yes, so why even discuss enforcing any alignment beyond the itemsize? Slicing is a central feature, so almost all libraries have to anticipate data that is at most itemsize-aligned. I don't understand how you can possibly do any more than recommend allocating data with a larger alignment (e.g. because certain compute libs may not accept the data otherwise – which is fair).

Calculating alignment, and rejecting data that is not hard! You could also ask the exporter to provide the information (e.g. as a power-of-two alignment – many exporters likely already store it in some form; I would like to for NumPy) or you expect the importer to calculate it. To be nice: provide a helper for it (which also helps to correctly handle length <= 1 dimensions as their stride does not matter for alignment/contiguity).

seberg avatar Oct 26 '21 14:10 seberg

Thanks @seberg. In the specific case of DLPack, there is also a need of supporting opaque data ptr (in the case of vulkan and opencl), in those cases the byte_offset was used to support slicing(while the data itself remained aligned as the original allocation) in a native way.

Just trying to bring up context here and background for both cases, not necessarily arguing one versus another

tqchen avatar Oct 26 '21 20:10 tqchen

The use of byte_offset for opaque data pointers is interesting. Do you suggest to also use it for non-opaque data pointers to enforce/allow a bigger allocation alignment guarantee then data alignment? Is that useful, scanning that initial link @leofang posted sounded like it might be?

How is the current header comment/standard meant? Are we supposed to use byte_offset for non-opaque CPU data pointers or not?

seberg avatar Oct 26 '21 21:10 seberg

Based on the plaintext interpretation of the header comment, if the data is not aligned, a usage of byte_offset is certainly encouraged to unify that with opaque ptr situation. Of course on CPU there is not really great advantages as checking the ptr itself is equally simple.

In reality, i think a lot of the current exchange impls ignores that constraint, so it becomes an open question whether or not we encourage such a convention, or allow people to pass in their own alignment preference.

tqchen avatar Oct 26 '21 22:10 tqchen

The use of byte_offset for opaque data pointers is interesting. Do you suggest to also use it for non-opaque data pointers to enforce/allow a bigger allocation alignment guarantee then data alignment? Is that useful, scanning that initial link @leofang posted sounded like it might be?

cc @shwina @jrhemstad

Yes this would be really helpful. Knowing the base allocation pointer, offset, and alignment often allows for much more efficient code.

kkraus14 avatar Oct 27 '21 14:10 kkraus14

@kkraus14 OK, that makes sense to me. I am still not sure if you would want to expect the importer to calculate both base and non-base pointer alignment? Or should DLPack grow a field to indicate alignment (e.g. for opaque pointers). If it is padding and alignment, would that mean that the exporter must indicate it anyway (i.e. for padding)?

I suppose alignment could be device specific, maybe some devices malloc provides those guarantees always?

Would you suggest changing the dlpack header to explicitly say:

  • All exported data should be at least aligned to (device alignment for the datatype OR itemsize) (LATE EDIT: itemsize alignment is good, but could be slightly "tricky" e.g. for complex longdoubles on CPU)
  • If the data is a view, byte_offset should be used, so that the data pointer represents the full allocation alignment
  • Adding an indicator for the alignment (padding?) of the data pointer could be useful?

seberg avatar Oct 27 '21 14:10 seberg

I am still not sure if you would want to expect the importer to calculate both base and non-base pointer alignment?

I would assume that allocations are aligned, which means the base pointer is aligned, but the non-base pointer is not necessarily aligned and is determined by the offset.

Or should DLPack grow a field to indicate alignment (e.g. for opaque pointers). If it is padding and alignment, would that mean that the exporter must indicate it anyway (i.e. for padding)?

I would love to specify both padding and alignment, but beggars can't be choosers 😄.

I suppose alignment could be device specific, maybe some devices malloc provides those guarantees always?

Memory pools are super common that don't necessarily have to follow the alignment / padding guarantees of the base malloc implementation.

Would you suggest changing the dlpack header to explicitly say:

  • All exported data should be at least aligned to (device alignment for the datatype OR itemsize)

I don't know if we want to go to the level of device alignment given the number of memory pools out there. Requiring the alignment to be a multiple of the itemsize makes sense to me though.

  • If the data is a view, byte_offset should be used, so that the data pointer represents the full allocation alignment

Yes

  • Adding an indicator for the alignment (padding?) of the data pointer could be useful?

Yes, adding both would be ideal and would allow more optimized code

kkraus14 avatar Oct 27 '21 14:10 kkraus14

Ah, so padding means actually means "you are allowed to write over that data?" and not just "you can safely read more?". The only thing that makes me a bit unsure about itemsize alignment that I realized is that NumPy would currently have some trouble guaranteeing it for complex longdouble.

seberg avatar Oct 27 '21 14:10 seberg

Ah, so padding means actually means "you are allowed to write over that data?" and not just "you can safely read more?".

Ideally, yes. I imagine the read side is more important, but generally being able to leverage vector loads / stores is typically important for optimizations.

The only thing that makes me a bit unsure about itemsize alignment that I realized is that NumPy would currently have some trouble guaranteeing it for complex longdouble.

Maybe there's an upper limit to where alignment doesn't matter relevant to itemsize anymore? I.E. anything above 64 bits or 128 bits only needs 64 bit or 128 bit alignment?

kkraus14 avatar Oct 27 '21 15:10 kkraus14

At this point, all I care about is to have a definite answer which exports NumPy is supposed to reject due to insufficient alignment (or readonly data for that matter). And that this answer is added to dlpack.h.

pytorch has this:

  atDLMTensor->tensor.dl_tensor.byte_offset = 0;

And it supports slicing, right? So, torch can't possibly have export guarantees beyond the itemsize. It seems safe to assume that nobody else cared about it enough to provide alignment beyond what their library guarantees (otherwise, this question would have come up more often).

Now, I don't know what their allocation strategy is, so I don't know whether they even guarantee itemsize-alignment for complex128 on 32bit platforms. Since GCC has this to say:

The address of a block returned by malloc or realloc in GNU systems is always a multiple of eight (or sixteen on 64-bit systems). (link(

which guarantees 8-bytes while complex-double has 16. So even itemsize-aligned is probably broken (albeit only for complex doubles).

As for their fromDLPack function just below: It doesn't even check if byte_offset != 0, so maybe we should just always keep it 0 and and not think about it for CPU data... (unless you want to open a bug report at pytorch after clarifying this.)


As far as I am aware there is no versioning for __dlpack__ or __dlpack_device__ and no way to move API/ABI forward without a dance using some kind of try/except. So, while I find it interesting, I will give up discussing what may be nice to have until that is clarified. (Yes, I am cross. I had explicitly warned about this.).

To be painfully clear: I very much still think that DLPack in its current form needs both extension and better extensibility to be the good implicit protocol that data-apis should in my opinion aim for. (I really don't hate DLPack, but I don't see that was ever tuned or vetted for this task.)

With that: Sorry, I am going to try and leave you all in peace now... If you want to consider extending DLPack liberally and with versioning, I will be interested (but also happy to stay away to leave you in peace – seriously, just tell me). Until then, I will simply try to just stop caring it.

seberg avatar Oct 27 '21 20:10 seberg

As far as I am aware there is no versioning for __dlpack__ or __dlpack_device__ and no way to move API/ABI forward without a dance using some kind of try/except.

Actually this is not my first time hearing the very same comment. I do believe we should get this and other DLPack issues (tracked in the meta-issue https://github.com/dmlc/dlpack/issues/74) addressed for the v2021 milestone.

leofang avatar Oct 27 '21 20:10 leofang

Now, I don't know what their allocation strategy is, so I don't know whether they even guarantee itemsize-alignment for complex128 on 32bit platforms.

PyTorch doesn't support 32-bit platforms AFAIK. It's not impossible that there's a custom armv7 mobile build or some such thing floating around, but certainly there's no wheels, conda packages or CI for 32-bit platforms.

EDIT: confirmed, trying to build PyTorch for a 32-bit platform will raise an exception immediately. Only libtorch (C++) has a 32-bit build.

As for their fromDLPack function just below: It doesn't even check if byte_offset != 0, so maybe we should just always keep it 0 and and not think about it for CPU data...

This may be true for other libraries as well. It seems clear from the above this discussion that there's a desire to have alignment better specified and in the future enforced. It looks to me like this will need gradual evolution:

  1. Document better in dlpack.h how to deal with alignment. I opened a PR for that: https://github.com/dmlc/dlpack/pull/83
  2. Fix the consumer functions (from_dlpack) in all known implementations to correctly handle byte offsets, and give warnings for non-aligned input.
  3. Wait at least one release before adjusting the producer functions (__dlpack__ / to_dlpack) to have an aligned base pointer and nonzero byte_offset.
  4. Wait another period of time, then give hard errors in the consumer function.

rgommers avatar Nov 09 '21 12:11 rgommers

The use of byte_offset for opaque data pointers is interesting. Do you suggest to also use it for non-opaque data pointers to enforce/allow a bigger allocation alignment guarantee then data alignment? Is that useful, scanning that initial link @leofang posted sounded like it might be?

Yes this would be really helpful. Knowing the base allocation pointer, offset, and alignment often allows for much more efficient code.

It looks like the answer changed here (or at least my reading of it): (xref https://github.com/dmlc/dlpack/pull/83#issuecomment-964382536). Let me try to summarize:

  • Not all libraries have aligned allocators, certainly not to 256 bytes.
  • There is no place currently for the producer to put information about the alignment of the data in the dlpack capsule it is exporting. It cannot be added until there is a version attribute and a way to extend DLPack (separate issue that needs tackling, state of today is that that doesn't exist).
  • The consumer cannot know whether it is safe to read memory between data and data + byte_offset. Hence aligned loads are not possible today unless data + byte_offset happens to be aligned already.
  • And no aligned loads for opaque data pointers may be possible at all, unless there's a device/language guarantee about this (quoting @seberg: "Is a 256 byte alignment guaranteed for any possible opaque pointer?! Because if it is not, I don't see any point for this guarantee"). It is not documented for which devices this is the case.

These are the options:

  1. A1: required alignment. Require the data pointer to always be aligned (using nonzero byte_offset), and do the gradual evolution plan in my comment above.
  2. A2: no alignment. remove the allocation requirement completely from dlpack.h. no library needs to make any changes (except if current handling of byte_offset is buggy, like @seberg pointed out for PyTorch). NumPy and other new implementers then just use byte_offset=0 always (easiest), and we're done.
  3. A3: optional alignment. Do not require alignment, but add a way to communicate from the producer to the consumer what the alignment of the data is.

The current status is that the fine print in dlpack.h requires alignment (option A1), but no one adheres to it or enforces it. This state is not very useful: it requires a >1 year evolution plan, and apparently there's no gain because of the third bullet above. So it looks like the best choices are either A2 or A3. A3 seems strictly better than A2, and most of the work it requires (versioning/extensibility) is work we wanted to do for other reasons already.

So here's a new proposal:

  • Decide that the long-term desired state is A3: optional alignment
  • NumPy and other new implementers to do whatever is simplest, i.e. to use byte_offset = 0 and data pointing to the first element in memory.
  • Update the comment in dlpack.h about this topic to reflect: current state, desired future state, and a link to a new issue on the DLPack repo with more info (outcome of this discussion to be summarized on that issue).

rgommers avatar Nov 09 '21 18:11 rgommers

If we're making longer term changes surrounding optional alignment in A3, it would be nice to have optional padding as well so that we know how many bytes past data + byte_offset + size can be safely read to allow further optimization as well.

kkraus14 avatar Nov 09 '21 21:11 kkraus14

@leofang Does it make sense to keep this issue open? I'm not sure the current status or whether things/perspectives have changed since 2021.

kgryte avatar Mar 21 '24 07:03 kgryte