xtensor
xtensor copied to clipboard
xtensor slower than numpy
I'm trying to replicate this numpy function in xtensor:
def upscale(arr):
start = time.time()
w, h = arr.shape
arr = np.pad(arr, 1, mode="edge") / 16
lu, ru, ld, rd = arr[:-2, :-2], arr[:-2, 2:], arr[2:, :-2], arr[2:, 2:]
lu, ru, ld, rd = lu*9+ru*3+ld*3+rd, lu*3+ru*9+ld+rd*3, lu*3+ru+ld*9+rd*3, lu+ru*3+ld*3+rd*9
ret = np.stack([lu, ru, ld, rd]).reshape(2, 2, w, h).transpose(2, 0, 3, 1).reshape(w*2, h*2)
print(f"upscale took {(time.time()-start)*1000} ms")
return ret
My attempt:
template <class E>
inline xt::xarray<float> upscale(E&& e) noexcept
{
ta::Timeit time("upscale");
auto step1 = xt::pad(e, {{1,1}, {1,1}, {0,0}}, xt::pad_mode::symmetric);
auto lu = xt::view(step1, xt::range(_, -2), xt::range(_, -2));
auto ru = xt::view(step1, xt::range(2, _), xt::range(_, -2));
auto ld = xt::view(step1, xt::range(_, -2), xt::range(2, _));
auto rd = xt::view(step1, xt::range(2, _), xt::range(2, _));
auto lu2 = lu*9 + ru*3 + ld*3 + rd;
auto ru2 = lu*3 + ru*9 + ld + rd*3;
auto ld2 = lu*3 + ru + ld*9 + rd*3;
auto rd2 = lu + ru*3 + ld*3 + rd*9;
auto step3 = xt::eval(xt::reshape_view(xt::stack(
xtuple(
xt::stack(xtuple(lu2, ru2), 1),
xt::stack(xtuple(ld2, rd2), 1)
), 3), {e.shape(0)*2, e.shape(1)*2})*(1.f/16.f));
return step3;
}
The python version takes around 35ms to evaluate, while the c++ version runs in around 300ms (both for 2048**2 input size). This is already after I tried to optimize the code quite a lot.
Why is the C++ version slower and how do I bring it up to speed? Relevant c++ compilation options command:
/usr/bin/c++
-DXTENSOR_USE_XSIMD
-I/home/quinor/kody/tsparter/tsparter/.
-I/home/quinor/kody/tsparter/tsparter/include
-I/home/quinor/kody/tsparter/build/_deps/stb-src
-I/home/quinor/kody/tsparter/build/_deps/xtensor-src/include
-I/home/quinor/kody/tsparter/build/_deps/xtl-src/include
-I/home/quinor/kody/tsparter/build/_deps/xsimd-src/include
-O3
-DNDEBUG
-std=gnu++17
-march=native
-MD
-MT tsparter/CMakeFiles/tsparter.dir/image_filters.cc.o
-MF CMakeFiles/tsparter.dir/image_filters.cc.o.d
-o CMakeFiles/tsparter.dir/image_filters.cc.o
-c /home/quinor/kody/tsparter/tsparter/image_filters.cc
We have a performance issue with the current implementation of the views, wihch produce bad assembly code. This issue is under investigation.
That's unfortunate :/ the views operations are majority of what I'm currently doing...
We have a performance issue with the current implementation of the views, wihch produce bad assembly code. This issue is under investigation.
Hi, I don't know if this is relevant to this issue but a simple transpose on a { 37, 8400 }
took 10ms
int boxRows = 37;
int boxCols = 8400;
std::vector<int> boxOutputArrShape = { boxRows, boxCols };
xt::xarray<float> boxOutputArr = xt::adapt(boxOutputFloatBuffer,
boxRows * boxCols,
xt::no_ownership(),
boxOutputArrShape);
// This took a good 10ms running on iPhone 12 Pro while the equivalent in OpenCV took less than 1ms
xt::xarray<float> predictionsArr = xt::transpose(boxOutputArr);
Is there anything wrong with the above snippet? Or if this is an xt::view
issue, is there any version that I can downgrade to to make it faster?