micro benchmarking assign expressions on all platforms

Open bd4 opened this issue 2 years ago • 0 comments

There are some backend / hardware combinations where a launch kernel doing a simple 1d operations, like a[i] = 2 * b[i], is faster than the equivalent gtensor expression, a = 2 * b. This needs to be explored further and optimization techniques considered, and possibly reproducers sent to GPU vendors (ideally with a port to the underlying GPU vendor programming model not using all of gtensor, when possible).

There are also potential issues when the size of the array dimensions are not multiples of the warp size of the underlying architecture.

See #248 which adds benchmarks for exploring this.

Feb 24 '23 14:02 bd4