xtensor-blas
xtensor-blas copied to clipboard
Tensordot, views, performance
Hello,
I'm just starting exploring the possibilities of the xtensor-stack
, it was a bit rough at first as I had not done proper C++ in a while but I found my way around it after some time. However, I feel I am probably missing some things.
My current setup is based on :
-
xtensor=0.21.4
(conda-forge) -
xsimd=7.4.7
(conda-forge) -
xtensor-blas
from source -
MKL
as blas -
target_link_libraries(<my-module> PRIVATE xtensor xtensor::optimize xtensor::use_xsimd)
in the CMake -
gcc=7.3.0
I am doing a quick comparison with numpy (linked with mkl as well), generally I feel that I'm being 1.5x slower, which I find a bit surprising as most of what I do are large matrix computations.
For instance, I was trying to do a tensordot on the last dimension of two 3-d tensors. I was trying two methods :
template <class G>
auto tensordot(const xt::xexpression<G> &e1, const xt::xexpression<G> &e2) {
const G &m1 = e1.derived_cast();
const G &m2 = e2.derived_cast();
return xt::eval(xt::linalg::tensordot(m1, m2, {2}, {2}));
}
template <class G>
auto tensordot_manual(const xt::xexpression<G> &e1, const xt::xexpression<G> &e2) {
const G &m1 = e1.derived_cast();
const G &m2 = e2.derived_cast();
auto mm1 = xt::reshape_view(m1, {m1.shape(0)*m1.shape(1), m1.shape(2)});
auto mm2 = xt::reshape_view(m2, {m2.shape(0)*m2.shape(1), m2.shape(2)});
// Note: only returning a matrix here and not the 4d tensor like above
return xt::eval(xt::linalg::dot(mm1, xt::transpose(mm2)));
}
Registered with pybind
as
m.def("tensordot",
[](const xt::pytensor<float, 3>& m1, const xt::pytensor<float, 3>& m2){
return tensordot<xt::pytensor<float, 3>>(m1, m2);
}, "M"_a, "N"_a);
m.def("tensordot_manual",
[](const xt::pytensor<float, 3>& m1, const xt::pytensor<float, 3>& m2){
return tensordot_manual<xt::pytensor<float, 3>>(m1, m2);
}, "M"_a, "N"_a);
Now trying a simple timing in a jupyter notebook
The direct tensordot
is roughly 1.5x slower which is unfortunately what I seem to get often. But I am more confused at the manually reshaped and transposed version (tensordot_manual
), which is even faster in numpy but much MUCH slower with my code.
Any thought on what is happening here? Having a look at xt::linalg::dot
, it seems everything should be mapped to single blas-call, as the reshaping and the transposition should be just views of the same data.
Well, as usual when I am blocked on something for hours and when I post about it, I find out why (well partially here).
Replacing:
auto mm1 = xt::reshape_view(m1, {m1.shape(0)*m1.shape(1), m1.shape(2)});
auto mm2 = xt::reshape_view(m2, {m2.shape(0)*m2.shape(1), m2.shape(2)});
with:
xt::xtensor<float, 2> mm1 = xt::reshape_view(m1, {m1.shape(0)*m1.shape(1), m1.shape(2)});
xt::xtensor<float, 2> mm2 = xt::reshape_view(m2, {m2.shape(0)*m2.shape(1), m2.shape(2)});
Seems to solve the main difference. So I have then two questions:
- Why do I seem to get 1.5x slower on a matrix multiplication, technically wihtout doing any copies and with the same BLAS?
- How should
reshape_view
be called so that it has static rank information, which seemed to be the missing bit here?