cupy ndarray.set() is dramatically slower when the input is a non contiguous array

Description

.set() is much slower when the input is slice of a multidimensional array. I think the offending line is https://github.com/cupy/cupy/blob/f0d0e7b2675a232c12e59badd5a6c42c5815ef1c/cupy/_core/core.pyx#L1779

I think using this function hurts performance by having to do this reshuffle, and furthermore if the original arrays are in pinned memory then this logic isn't preserving the pinned memory benefit.

To Reproduce

# Write the code here

Installation

No response

Environment

# Paste the output here

Additional Information

No response

Jun 04 '22 00:06 kkotyk

Thanks for the feedback! When coping CPU memory to GPU, the memory needs to be contiguous, so this is a restriction that is not easy to relax.

c.f. #6785

Jun 14 '22 09:06 kmaehashi

I just wonder if we can write our own method of ascontiguous which can reshuffle an array slice directly onto the gpu by using cudamemcpy on the contiguous sections. Or if we can at least write a version of ascontiguous which is backed by pinned memory.

Jun 14 '22 23:06 kkotyk

Or if we can at least write a version of ascontiguous which is backed by pinned memory.

Yes, I agree pinned memory should be used here. We'll work on this in #6785.

Jun 15 '22 05:06 kmaehashi