Revisit BLAS tuning
The cutoffs for small matrices are not necessarily accurate, depending on the machine, especially when BLAS is using threads.
Directly related to this (not sure it's worth opening a new issue): this will also imply having some support for BLAS threads, at least for a few of the most popular BLAS libraries.
For example some profiling files just changed now (https://github.com/flintlib/flint/pull/2019) were assuming that if Flint uses BLAS then it can use openblas_set_num_threads. I'm not sure to which extent it is feasible e.g. to ask at configure time the name of a similar function to use for the provided choice of BLAS library? (I see e.g. BLIS has bli_thread_set_num_threads but I'm not sure all most popular BLAS libraries have this)