ndarray.set() is dramatically slower when the input is a non contiguous array
Description
.set() is much slower when the input is slice of a multidimensional array. I think the offending line is https://github.com/cupy/cupy/blob/f0d0e7b2675a232c12e59badd5a6c42c5815ef1c/cupy/_core/core.pyx#L1779
I think using this function hurts performance by having to do this reshuffle, and furthermore if the original arrays are in pinned memory then this logic isn't preserving the pinned memory benefit.
To Reproduce
# Write the code here
Installation
No response
Environment
# Paste the output here
Additional Information
No response
Thanks for the feedback! When coping CPU memory to GPU, the memory needs to be contiguous, so this is a restriction that is not easy to relax.
c.f. #6785
I just wonder if we can write our own method of ascontiguous which can reshuffle an array slice directly onto the gpu by using cudamemcpy on the contiguous sections. Or if we can at least write a version of ascontiguous which is backed by pinned memory.
Or if we can at least write a version of ascontiguous which is backed by pinned memory.
Yes, I agree pinned memory should be used here. We'll work on this in #6785.