xtensor icon indicating copy to clipboard operation
xtensor copied to clipboard

How do I get the most out of xtensor for immediate evaluations on large arrays? bonus: xtensor_fixed bug.

Open masteranza opened this issue 3 years ago • 8 comments

Hi! I've been here before complaining about the lack of recent benchmarks. I hoped someone is going to do it, but since nobody showed up and I find many of the xtensor features attractive for scientific computing, I've decided to try myself in some very limited sense.

What I'm mostly interested in are immediate, repeated complex multiplications (think evolution) on large tensors (not necessairly high dim) of complex (think wavefunction), together with some reductions (think calculating weighted averages over tensors).

Since I'm also a little time constrained at the moment I wanted to limit myself to the most performant scenario for xtensor, namely: xtensor_fixed with SIMD enabled and disabled exceptions compared with good old c-arrays. Unfortunately, it turned out that I only could do those up to size 700x700, due to a bug (more about this at the end). For which I've got the following results for 5000 runs (all except init which is done once, as I don't care that much):

  init time:     15 ms or  0 s
 xinit time:     12 ms or  0 s
  test time:   1716 ms or  1 s
 xtest time:   1689 ms or  1 s
  norm time:   2328 ms or  2 s
 xnorm time:   2346 ms or  2 s

Negligible difference between both - I'd say. Next I've tested larger arrays 8192x8192 with 500 runs, but with xarray<cplxd, xt::layout_type::row_major> obtaining:

  init time:    742 ms or  0 s
 xinit time:   1007 ms or  1 s
  test time:  23492 ms or 23 s
 xtest time:  23655 ms or 23 s
  norm time:  32419 ms or 32 s
 xnorm time:  32236 ms or 32 s

Going a little overboard (or on swap) I've also tested 16384x16384 array 5 times:

  init time:   4305 ms or  4 s
 xinit time:   4650 ms or  4 s
  test time:  31953 ms or 31 s
 xtest time:  24254 ms or 24 s
  norm time:  13527 ms or 13 s
 xnorm time:  14031 ms or 14 s

I'm pasting the code I'm using for benchmarking (I'm not an expert on this stuff, also please excuse the number of unnecessary includes), together with compilation command:

g++-11 -march=native -O3 -xc++ -std=c++17 -lstdc++ -shared-libgcc -I/opt/homebrew/include -L/opt/homebrew/lib xtest.cpp -o xtest

#include <stdio.h>
#include <stdlib.h>
#include <chrono>
#include <thread>
#include <complex>
#include <vector>
#include <string_view>
#define XTENSOR_USE_XSIMD
#define XTENSOR_DISABLE_EXCEPTIONS
//#define XTENSOR_USE_OPENMP
#include <xtensor/xarray.hpp>
#include <xtensor/xio.hpp>
#include <xtensor/xview.hpp>
#include <xtensor/xfixed.hpp>
#include <xtensor/xreducer.hpp>
#include <xtensor/xeval.hpp>
#include "xtensor/xmath.hpp"
#include "xtensor/xrandom.hpp"
#include "xtensor/xsort.hpp"
#include "xtensor/xnorm.hpp"
#include "xtl/xtype_traits.hpp"

#include <chrono>

using TimeVar = std::chrono::high_resolution_clock::time_point;
using cplxd = std::complex<double>;

TimeVar time_start;
TimeVar time_stop;
TimeVar time_calc_start;
TimeVar time_calc_stop;
TimeVar time_now;
//TOTAL
void startTimer(TimeVar& start)
{
	start = std::chrono::high_resolution_clock::now();
}

void stopTimer(TimeVar& start, TimeVar& stop, std::string_view name)
{
	using std::chrono::duration;
	using std::chrono::milliseconds;
	using std::chrono::duration_cast;
	stop = std::chrono::high_resolution_clock::now();
	size_t ms = duration_cast<milliseconds>(stop - start).count();

	printf("%10s time: %6td ms or %2td s\n", name.data(), ms, ms / 1000);
}
//Per MODE
void startCalcTimer()
{
	startTimer(time_calc_start);
}
void stopCalcTimer(std::string_view name)
{
	stopTimer(time_calc_start, time_calc_stop, name);
}


constexpr cplxd c = { 0.70710678, 0.70710678 };
constexpr cplxd one = { 1.0,1.0 };
constexpr size_t n = 10000;
constexpr int runs = 50;
cplxd* a = new cplxd[n * n];
// xt::xtensor_fixed<cplxd, xt::xshape<n, n>> b{};
std::vector<size_t> shape = { n,n };
xt::xarray<cplxd, xt::layout_type::row_major> b(shape);
void init() noexcept
{
	for (int i = 0; i < n; i++)
		for (int j = 0; j < n; j++)
			a[i * n + j] = { sin(i + j), cos(i + j) };
}
void test() noexcept
{
	for (int i = 0; i < n; i++)
		for (int j = 0; j < n; j++)
			a[i * n + j] *= c;
}
void xinit() noexcept
{
	for (int i = 0; i < n; i++)
		for (int j = 0; j < n; j++)
			b.at(i, j) = { sin(i + j), cos(i + j) };
}
void xtest() noexcept
{
	// for (int i = 0; i < n; i++)
		// for (int j = 0; j < n; j++)
			// b.at(i, j) *= c;
	std::transform(b.cbegin(), b.cend(), b.begin(), [](auto&& v) { return v * c; });
	// b *= c;
}
double nrm() noexcept
{
	double res;
	for (int i = 0; i < n; i++)
		for (int j = 0; j < n; j++)
			res += std::norm(a[i * n + j]);
	return res;
	// printf(" nrm()= %g %g\n", res.real(), res.imag());
}
double xnrm() noexcept
{
	return xt::norm_sq(b, xt::evaluation_strategy::immediate)();
	// std::cout << "xnorm()= " << res << std::endl;

}

int main(int argc, char* argv[])
{
	startCalcTimer();
	init();
	stopCalcTimer("init");

	startCalcTimer();
	xinit();
	stopCalcTimer("xinit");

	startCalcTimer();
	for (int i = 0;i < runs;i++)
		test();
	stopCalcTimer("test");

	startCalcTimer();
	for (int i = 0;i < runs;i++)
		xtest();
	stopCalcTimer("xtest");

	double all = 0;
	startCalcTimer();
	for (int i = 0;i < runs;i++) all += nrm();
	stopCalcTimer("norm");
	printf("%10e\n", all);

	double allx = 0;
	startCalcTimer();
	for (int i = 0;i < runs;i++) allx += xnrm();
	stopCalcTimer("xnorm");
	printf("%10e\n", allx);
}

My final questions are:

  1. How can I make xtensor_fixed work on big arrays? Currently when I try to run it on ARM64 (Apple M1) with n=7000 I get the following errors (both with SIMD/noexcept and without):

/var/folders/61/dznsn81x52s7bh8l3445xj8h0000gn/T//ccRuFHBq.s:1001:2: error: addend too big for relocation adrp x20, _b@PAGE+784000032 ^ /var/folders/61/dznsn81x52s7bh8l3445xj8h0000gn/T//ccRuFHBq.s:1003:2: error: addend too big for relocation add x20, x20, _b@PAGEOFF+784000032;momd ^ /var/folders/61/dznsn81x52s7bh8l3445xj8h0000gn/T//ccRuFHBq.s:1003:2: error: fixup value out of range add x20, x20, _b@PAGEOFF+784000032;momd ^ /var/folders/61/dznsn81x52s7bh8l3445xj8h0000gn/T//ccRuFHBq.s:1620:2: error: addend too big for relocation adrp x2, _b@PAGE+784000032 ^ /var/folders/61/dznsn81x52s7bh8l3445xj8h0000gn/T//ccRuFHBq.s:1621:2: error: addend too big for relocation add x2, x2, _b@PAGEOFF+784000032;momd ^ /var/folders/61/dznsn81x52s7bh8l3445xj8h0000gn/T//ccRuFHBq.s:1621:2: error: fixup value out of range add x2, x2, _b@PAGEOFF+784000032;momd

  1. What else can I do make xtensor perform faster on xtest() and xnrm() tests I've described?
  2. I know that I probably could have done xinit() in a simpler way, but I'm not sure how.
  3. I'd be happy to see other testing it and posting the results on other (say, x86 machines).
  4. Do the results mean that the SIMD optimizations hold little to no benefit for this particular case over compiler built in optimizations?

EDIT: formatting code

masteranza avatar Jul 07 '21 23:07 masteranza

Hello,

I hoped someone is going to do it, but since nobody showed up and I find many of the xtensor features attractive for scientific computing, I've decided to try myself in some very limited sense.

Thanks for these benchmarks. We previously used google benchmark along with google test, but since we are moving away from google test we should probably search for a benchmark library. We have very limited bandwidth currently, that's why nobody hopped on this. We should be more responsive in the near future.

Regarding your last questions:

  1. It is not possible with the current implementation of xtensor_fixed, because under the hood it allocates the data buffer on the stack. Therefore you are limited by the size of the stack of the process. A fix would be to allow dynamic allocation of the data buffer in xfixed_container.
  2. For xtest(), it looks like it is always equivalent to or faster than the C loop, so not sure you can get better performance. For xnrm() this requires invetigating and profiling the code of xreducer. But the difference is not that much
  3. I don't think we provide fancy enough generators to simplify that.
  4. Will give it a try when I can.
  5. Compilers are able to vectorize simple loops like those (using intrinsics in the generated assembly), making the generated assembly equivalent to that with xsimd. You should notice differences if you use mathematical functions for which there is no intrinsics available (cos, exp, log).

JohanMabille avatar Jul 08 '21 10:07 JohanMabille

@JohanMabille Thanks for a rapid and complete answer! So regarding 1. would it be possible to do what you've described while keeping the fixed (compile-time) strides etc.? If so, can one simply use a custom container or does one need to modify xfixed_container? 2. Well, I'm not yet sure it is really faster - I'll need to re-run the tests. This one instance when it significantly faster was done with significant use of swap memory, so fancy things might have been going on. Anyway, it's quite reassuirng to see that xtensor performance at least as well as bare c-loops. 3. Sure, I'll try to learn the library better. I understand that you're all busy developing it. 4. Thanks! 5. That sounds really promising! Thank you!

Btw. your medium blog documenting xtensor development is fantastic! Looking forward to see more!

masteranza avatar Jul 08 '21 10:07 masteranza

Btw. your medium blog documenting xtensor development is fantastic! Looking forward to see more!

Many thanks! I need to find the time to complete part 1 (only one article missing, and I already wrote half of it).

Regarding point 1, that would require some change in xfixed_container, but that should not be that much. Basically adding a template parameter to the class (with a default argument for backward compatibility).

Regarding point 3, I think my previous answer (that I have edited) was not clear. What I meant is I don't think we can simplify your code with the generators we provide for now, but I'm not 100% sure. I should use my own stuff more often ;)

JohanMabille avatar Jul 08 '21 13:07 JohanMabille

Thanks a lot for this discussion. I agree that at the moment we are missing good benchmarks, and that that is something we should do in order to be on top op potential gains.

I just want to jump in on point 1. I thought that we already have what we want:

  • xtensor: (fixed) compile-time allocated strides and shape.
  • xtensor_fixed: (fixed) compile-time allocated data on the stack.

(and of course xarray having everything dynamic). So I'm not entirely sure which case is missing between the two?

tdegeus avatar Jul 09 '21 08:07 tdegeus

The computation of the shape and the strides of xtensor_fixed is done at compile time, while that of xtensor is done at runtime.

JohanMabille avatar Jul 09 '21 09:07 JohanMabille

Ah right, you had the computation in mind, not the allocation

tdegeus avatar Jul 09 '21 09:07 tdegeus

@JohanMabille, @tdegeus will you find some time to implement dynamic storage for xfixed_container anytime soon? Beside me needing it, it would also be very interesting to see the updated benchmarks.

masteranza avatar Jul 16 '21 15:07 masteranza

I would happily review a PR. However, I don't have time to do it myself soon, sorry.

tdegeus avatar Jul 28 '21 10:07 tdegeus