`CsrParAssembler::assemble_pattern` significantly slower than `CsrAssembler::assemble_pattern` on a single thread
Benchmarks showed ~30-70% overhead for the parallel variant with RAYON_NUM_THREADS=1. The discrepancy seems to be primarily related to rayon, since some preliminary investigation showed that replacing e.g. into_par_iter with into_iter accounts for most of the overhead. Further overhead could be removed by using atomic locks (though this requires more thought for efficiently handling the multi-threaded case).
Update: Chucking all the code into a rayon::scope(|_| {} closure seems to remove a significant part of the overhead (but not all). This suggests that the switch between main thread and the rayon thread for the iterator might be part of the culprit, perhaps because the cache of the rayon thread will be "cold" compared to using the main thread all the way.