tpp-mlir icon indicating copy to clipboard operation
tpp-mlir copied to clipboard

Make 2D parallelization a run time choice

Open rengolin opened this issue 11 months ago • 0 comments

Currently, we're selecting our optimal blocking on the command line, with default {2,8} that is optimal for 16 threads.

On our benchmarks, we pick the best one for each number of threads, but the compiler can't do that, as OpenMP's OMP_NUM_THREADS change at run time.

We need to lower code that can interpret that environment variable (via OpenMP dialect) and create a dynamic loop blocking based on run time values, so that we only need to generate the code once and it can run on any number of threads.

We also need to know which are the best factors for each number of threads (cost model, per arch) and have a generated dispatch table so that we can chose them at run time.

rengolin avatar Feb 29 '24 23:02 rengolin