[RFC]: DLPack C Function for Speed Exchange
This is a cross ref RFC on DLPack based exchange. As of now, DLPack exchange relies on python functions such as tensor.__dlpack__(). While they works well for common cases, the general overhead of such exchange is at the level of 0.2-0.3 us for very well optimized version, and can go up to 0.4-1 us for less optimized implementation.
For a function that takes three arguments f(a, b, c), assume we run DLPack exchange for each argument, the general conversion overhead usually gets to around 0.7us - 3us.
While such overhead can be acceptable in many settings, in GPU applications the extra 1-3us overhead can still be significant. For a kernel that takes 2us to finish, 0.7 us means 30% additional overhead in execution
Recently, we propose to develop a set of specific C functions to help DLPack based exchange for array libraries that works on C extensions, please see more context here
https://github.com/dmlc/dlpack/issues/175
In the context of array-api, it would be useful to help standardize the specific field for such speed exchange
mypackage.Tensor.__dlpack_c_exchange_api__
Note that the proposed speed exchange function can be used in conjunction with the current DLPack exchange, to gracefully handle fallback cases.
This makes sense to me in principle. We always had this in mind I believe - get adoption first, and think about a C API if and when performance of Python dunder methods becomes limiting.
When https://github.com/dmlc/dlpack/issues/175 lands in a new DLPack version, I think we can simply reference that in the from_dlpack, __dlpack__ and design_topics/data_interchange docs as a recommendation to implement the C protocol as well.
A single new __c_dlpack_xxx method will probably be preferable over multiple methods, but that's already under discussion in https://github.com/dmlc/dlpack/issues/175 as well.
PR is up in https://github.com/data-apis/array-api/pull/984