tpp-mlir
tpp-mlir copied to clipboard
Make 2D parallelization a run time choice
Currently, we're selecting our optimal blocking on the command line, with default {2,8}
that is optimal for 16 threads.
On our benchmarks, we pick the best one for each number of threads, but the compiler can't do that, as OpenMP's OMP_NUM_THREADS
change at run time.
We need to lower code that can interpret that environment variable (via OpenMP dialect) and create a dynamic loop blocking based on run time values, so that we only need to generate the code once and it can run on any number of threads.
We also need to know which are the best factors for each number of threads (cost model, per arch) and have a generated dispatch table so that we can chose them at run time.