add multithreading for large inputs?
Would be cool to have built in support for something like tbb's parallel_for for large inputs, similar to how use use scalar vs SIMD depending on size.
@ronag How large are you thinking about?
It takes thousands on nanoseconds to start a thread. Up to, say, 200,000 ns on some systems (it varies greatly). And you haven't done anything yet, you have just started the thread. And there is overhead still to joint the thread once it is done.
If you have, say, a gigabyte of data, then you can go faster... if you have kilobytes, that's doubtful in my opinion. In the megabyte range, it is an open question (and depends on the host system).
Note that I am not dismissing the issue nor being argumentative.
I think you are missing the point. Frameworks like tbb implement an efficient thread pool and remove much of the overhead with context switching, thread creation, join etc... Would need some tests ofc to see at which sizes it makes sense. But I've even used it from memcpy many years ago in latency sensitive applications. I'm thinking this might make sense if the overhead is negligible at sizes of ~128k.
@ronag
I'm thinking this might make sense if the overhead is negligible at sizes of ~128k.
I agree and it answers my question (How large are you thinking about?). It would be interesting to run experiments.