llvm icon indicating copy to clipboard operation
llvm copied to clipboard

Enable `vec` class for `bfloat16`

Open densamoilov opened this issue 2 years ago • 5 comments

Currently, sycl::vec class doesn't work with bfloat16 data type. This seems to be caused by missing converters. Is there a plan to fix it?

Reproducer:

int main () {
    sycl::vec<sycl::ext::oneapi::experimental::bfloat16, 8> v(10.0f);
    return 0;
}

Error:

/compiler/latest/linux/bin-llvm/../include/sycl/CL/sycl/types.hpp:564:24: error: implicit instantiation of undefined template 'sycl::detail::BaseCLTypeConverter<sycl::ext::oneapi::experimental::bfloat16, 8>'
      typename detail::BaseCLTypeConverter<DataT, NumElements>::DataType;
                       ^
v.cpp:4:61: note: in instantiation of template class 'sycl::vec<sycl::ext::oneapi::experimental::bfloat16, 8>' requested here
    sycl::vec<sycl::ext::oneapi::experimental::bfloat16, 8> v(10.0f);
                                                            ^
/compiler/latest/linux/bin-llvm/../include/sycl/CL/sycl/types.hpp:120:36: note: template is declared here
template <typename T, int N> class BaseCLTypeConverter;
                                   ^
1 error generated.

densamoilov avatar Jul 13 '22 20:07 densamoilov

Does your use case require vec or can you use marray? If so why?

rolandschulz avatar Jul 14 '22 15:07 rolandschulz

I use vec because it provides interfaces to load/store/convert data. Also, SYCL specification says that vec is lowered to the backend specific vector types, which I believe should have positive performance impact.

densamoilov avatar Jul 15 '22 02:07 densamoilov

I use vec because it provides interfaces to load/store/convert data.

We probably should add a way to convert data (e.g. explicit cast marray of different types). Why use load/store rather then just dereference?

SYCL specification says that vec is lowered to the backend specific vector types, which I believe should have positive performance impact.

It says "where possible". For our implementation if sub-group size is >1 we vectorize across work-items and you don't normally benefit from vector types. If you actual see any benefit I would be very interested in that data. And if you use sub-group size =1 you currently get best performance with ESIMD type not the vec type.

rolandschulz avatar Jul 15 '22 16:07 rolandschulz

I'm mostly interested in writing SYCL kernels for NVIDIA (CUDA backend) and AMD (HIP backend).

It says "where possible"

Right, and CUDA and HIP backends have vector types so I expect vec to be mapped to them. Also the specification doesn't say that about marray so I assume marray may not be as efficient as vec.

Why use load/store rather then just dereference?

I'm not sure what you mean by "just dereference". If I use vec then I expect load/store operation to be implemented more efficiently then just filling vec with [] operator. Since I'm interested in implementing memory intensive operations I need load/store operations to be as efficient as possible.

For our implementation if sub-group size is >1 we vectorize across work-items and you don't normally benefit from vector types.

Is this true for CUDA and HIP backends?

If you actual see any benefit I would be very interested in that data.

I remember I had performance issues when I didn't use vec for CUDA backend. But I don't have data now. If you think vec is redundant in such cases then I can make some experiments.

densamoilov avatar Jul 15 '22 18:07 densamoilov

I've looked into similar applications and I think there is an inherent problem with the way marray is implemented; on the contrary to built-in vector types the alignment of marray follows the same rules as std::array, which makes it impossible to vectorise loads and stores to/from it. PTX ISA spec mandates that:

By default, vector variables are aligned to a multiple of their overall size (vector length times base-type size), to enable vector load and store instructions which require addresses aligned to a multiple of the access size.

As a result load store vectorizer sees those accesses as misaligned and bails out; the compiler then outputs scalar loads and stores that @densamoilov observed.

jchlanda avatar Jul 18 '22 13:07 jchlanda

Initial support (means no support for math built-ins with vec<bfloat16>) was added in #12261.

@densamoilov, I will close this feature request as complete, but please report any issues you have found. @cperkinsintel: FYI

AlexeySachkov avatar Mar 18 '24 16:03 AlexeySachkov