TensorComprehensions Add support for strided tensors

This commit is to start support for strided tensors. I made changes to percolate a vector in TensorInfo down to emitCudaKernel to allow codegen to cast strided tensors. This required changes to an unit test to expect the correct cast.

Jun 07 '18 22:06 protonu

After looking a bit deeper at the impl, I am not a fan of having 2 ways of specifying the strides during codegen (both Halide and TensorInfo) so we should chose 1. Worse, the mechanisms themselves may be incorrect because the name of the generated kernel will not vary with strides but the content of the kernel will be hardcoded for strides.

So I think I would go the following route, but this probably requires a second set of eyes cc @abadams @ftynse:

since Halide can represent strides, let's use Halide
add new parameters for each stride in translateParam and translateOutput (in tc2halide.cc) and make those the strides of the halide ImageParam and OutputImageParam
update computeParamValueMap in halide_utils.cc to tie the JIT stride information to the halide inputs
update codegen to take input strides into account

This will increase the dimension of the parameter space for ISL, it will also increase the specializedName and the list of values we send to each kernel call.

On the other hand, this will allow us to go to a parametric stride version since strides will be passed to the kernel.

So this is more complex than I originally anticipated for a bootcamp task, sorry about that.

Jun 08 '18 17:06 nicolasvasilache

Because Scop already owns the description of in/out tensors (through Halide in/out parameters), it makes sense for it to also own the stride information. It is more consistent than passing this information to codegen with a room for mismatching size or stride errors.

This will increase the dimension of the parameter space for ISL, it will also increase the specializedName and the list of values we send to each kernel call.

And we probably need to introduce new parameters due to polynomial strides. That is, for A[N][M][K], the strides are M*K, K, 1 and you cannot express M*K as an affine function. So you need another parameter _M_times_K and some external knowledge that _M_times_K === M * K.

Jun 09 '18 10:06 ftynse

On Fri, Jun 08, 2018 at 10:03:18AM -0700, Nicolas Vasilache wrote:

So I think I would go the following route, but this probably requires a second set of eyes cc @abadams @ftynse:

since Halide can represent strides, let's use Halide

I don't agree with this reasoning. It's not because we can do something that we necessarily need to do it.

This will increase the dimension of the parameter space for ISL, it will also increase the specializedName and the list of values we send to each kernel call.

Hold on! Why does this need to be encoded as isl parameters?

skimo

Jun 12 '18 12:06 skimo-openhub

I don't agree with this reasoning. It's not because we can do something that we necessarily need to do it.

Yeah that's essentially my motto. Who said we need to do it this way? I am putting a strawman proposal, feel free to propose something better. The reasoning behind using Halide for strides is that we are already using Halide at this level and that Halide already supports strides; we have just not used them. What's your alternative?

Hold on! Why does this need to be encoded as isl parameters?

They definitely don't need to, it is a proposal that followed the code flow as I was reading it and that would with minimal changes. But if having these extra parameters passed through ISL is a problem then by all means let's have an separate storage for them.

Jun 12 '18 13:06 nicolasvasilache

At some point I will also reconsider pulling this piece of code we had been using in ancient days. There is no particular reason to limit ourselves to subarray semantics, we just need the no-alias property.

Jun 12 '18 13:06 nicolasvasilache

Also, note that there is a tradeoff:

hardcoding the strides in the kernel which forces us to recompile each time there is a change in strides;
using parametric strides but I think cuda still does not support C99 style variable length arrays so that would force us to pull in the DeviceTensor abstraction which OTOH buys use more flexible striding support

Jun 12 '18 13:06 nicolasvasilache

On Tue, Jun 12, 2018 at 06:51:13AM -0700, Nicolas Vasilache wrote:

I don't agree with this reasoning. It's not because we can do something that we necessarily need to do it.

Yeah that's essentially my motto. Who said we need to do it this way? I am putting a strawman proposal, feel free to propose something better. The reasoning behind using Halide for strides is that we are already using Halide at this level and that Halide already supports strides; we have just not used them. What's your alternative?

I haven't look at it in detail, so I can't propose anything for now. Note that I don't necessarily disagree with the conclusion. I was only commenting on the reasoning.

Hold on! Why does this need to be encoded as isl parameters?

They definitely don't need to, it is a proposal that followed the code flow as I was reading it and that would with minimal changes. But if having these extra parameters passed through ISL is a problem then by all means let's have an separate storage for them.

I don't see any reason at this point why the strides should be made available as parameters, but a proper motivation might be able to convince me.

skimo

Jun 12 '18 14:06 skimo-openhub

On Tue, Jun 12, 2018 at 06:57:31AM -0700, Nicolas Vasilache wrote:

Also, note that there is a tradeoff:

hardcoding the strides in the kernel which forces us to recompile each time there is a change in strides;

Doesn't changing the strides usually change the sizes too? That is, if you only look at the even elements, then from the point of view of the polyhedral model, the array is only half as big and this may make a difference on the promotion decisions.

skimo

Jun 12 '18 14:06 skimo-openhub

Doesn't changing the strides usually change the sizes too?

That's because strides in an overloaded term .. in this instance polyhedral folks have been using it improperly IMO. Sizes and strides are totally unrelated you can mix and match them arbitrarily. In ML a tensor is almost always defined as a "view" over a contiguous block of memory.

A Torch/DLPack/Numpy/other tensor is a view with an initial offset and a list of sizes and strides (of the same length). The sizes (sometimes also called shape) determine the number of elements in the view; the strides determine the spacing between consecutive elements in the view.

Here is an example: T: offset=3, sizes=[2, 2], strides=[3, 7] will refer to elements (i, j) at offsets (3 + i * 3 + j * 7)) (notice potential aliasing if the tensor had > 21 elements).

(0, 0) -> @(3) 
(0, 1) -> @(10), 
(0, 0) -> @(6), 
(0, 0) -> @(13)

Strides corresponding to C99 variable length array are a special case: int [M][N][P] foo has a canonical view representation of offset=0, sizes=[M, N, P], strides=[N * P, P, 1] in an M*N*P allocation.

Such a view of a memory region has many properties that people like to use. For instance a transpose operation just consists of permuting 2 strides, it is a pure metadata transformation and requires no memory copy (contiguity and performance are not factors of course).

People also like to "review" tensors (traditionally called the reshape operation), for instance a 4-D output of a batched 2-D convolution can be reshaped (stride contiguity permitting) into a 2-D matric that can be passed to a GEMM operation. This is happening in many places in neural networks.

Of all these properties, we only care about non-aliasing because of obvious correctness issues. Like many interesting abstractions people can abuse them, many reshape/expand/narrow/slice/transpose operations that you may find in neural network implementations are unnecessary, sometimes they are used to make your computation fit in a library call (see the BLAS GEMM API and you'll see similar stride concepts lda, ldb, ldc IIRC).

Lastly note that the type of compact representation polyhedral folks are accustomed to are called "contiguous" tensors (i.e.sizes=[M, N, P], strides=[N * P, P, 1] in an allocation of size >= M*N*P) and they can safely be cast to a C99 array.

So the classification is C99 array == contiguous ML tensor <= non-aliasing ML tensor <= ML tensor

Makes sense?

Jun 12 '18 14:06 nicolasvasilache

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours has expired.

Before we can review or merge your code, we need you to email [email protected] with your details so we can update your status.

Jul 25 '18 21:07 facebook-github-bot

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

Oct 07 '20 15:10 facebook-github-bot

TensorComprehensions TensorComprehensions copied to clipboard

Add support for strided tensors

TensorComprehensions
TensorComprehensions copied to clipboard