llvm
llvm copied to clipboard
Enable `vec` class for `bfloat16`
Currently, sycl::vec
class doesn't work with bfloat16
data type. This seems to be caused by missing converters.
Is there a plan to fix it?
Reproducer:
int main () {
sycl::vec<sycl::ext::oneapi::experimental::bfloat16, 8> v(10.0f);
return 0;
}
Error:
/compiler/latest/linux/bin-llvm/../include/sycl/CL/sycl/types.hpp:564:24: error: implicit instantiation of undefined template 'sycl::detail::BaseCLTypeConverter<sycl::ext::oneapi::experimental::bfloat16, 8>'
typename detail::BaseCLTypeConverter<DataT, NumElements>::DataType;
^
v.cpp:4:61: note: in instantiation of template class 'sycl::vec<sycl::ext::oneapi::experimental::bfloat16, 8>' requested here
sycl::vec<sycl::ext::oneapi::experimental::bfloat16, 8> v(10.0f);
^
/compiler/latest/linux/bin-llvm/../include/sycl/CL/sycl/types.hpp:120:36: note: template is declared here
template <typename T, int N> class BaseCLTypeConverter;
^
1 error generated.
Does your use case require vec
or can you use marray
? If so why?
I use vec
because it provides interfaces to load/store/convert data. Also, SYCL specification says that vec
is lowered to the backend specific vector types, which I believe should have positive performance impact.
I use vec because it provides interfaces to load/store/convert data.
We probably should add a way to convert data (e.g. explicit cast marray of different types). Why use load/store rather then just dereference?
SYCL specification says that vec is lowered to the backend specific vector types, which I believe should have positive performance impact.
It says "where possible". For our implementation if sub-group size is >1 we vectorize across work-items and you don't normally benefit from vector types. If you actual see any benefit I would be very interested in that data. And if you use sub-group size =1 you currently get best performance with ESIMD type not the vec type.
I'm mostly interested in writing SYCL kernels for NVIDIA (CUDA backend) and AMD (HIP backend).
It says "where possible"
Right, and CUDA and HIP backends have vector types so I expect vec
to be mapped to them. Also the specification doesn't say that about marray
so I assume marray
may not be as efficient as vec
.
Why use load/store rather then just dereference?
I'm not sure what you mean by "just dereference". If I use vec
then I expect load/store operation to be implemented more efficiently then just filling vec
with []
operator. Since I'm interested in implementing memory intensive operations I need load/store operations to be as efficient as possible.
For our implementation if sub-group size is >1 we vectorize across work-items and you don't normally benefit from vector types.
Is this true for CUDA and HIP backends?
If you actual see any benefit I would be very interested in that data.
I remember I had performance issues when I didn't use vec
for CUDA backend. But I don't have data now. If you think vec
is redundant in such cases then I can make some experiments.
I've looked into similar applications and I think there is an inherent problem with the way marray
is implemented; on the contrary to built-in vector types the alignment of marray
follows the same rules as std::array
, which makes it impossible to vectorise loads and stores to/from it. PTX ISA spec mandates that:
By default, vector variables are aligned to a multiple of their overall size (vector length times base-type size), to enable vector load and store instructions which require addresses aligned to a multiple of the access size.
As a result load store vectorizer sees those accesses as misaligned and bails out; the compiler then outputs scalar loads and stores that @densamoilov observed.
Initial support (means no support for math built-ins with vec<bfloat16>
) was added in #12261.
@densamoilov, I will close this feature request as complete, but please report any issues you have found. @cperkinsintel: FYI