simd icon indicating copy to clipboard operation
simd copied to clipboard

Parallel API

Open Nugine opened this issue 2 years ago • 6 comments

Improve throughput by using multi-threading (rayon).

In most cases, the simd functions are fast enough that we may not benefit from multi-threading. And some functions are not divisible.

Crates with parallel api:

  • https://github.com/uhmarcel/rbase64

Nugine avatar Nov 27 '22 02:11 Nugine

Parallel threshold

$$ \frac{n}{pv}+c<\frac{n}{v} $$

$$ n > \frac{p}{p-1}cv $$

$n$: input size (B) $p$: num threads $v$: throughput (B/ns) $c$: threading overhead (ns)

Parallel time rate (smaller is better)

$$ \frac{n/(pv)+c}{n/v} < 1 $$

$$ \frac{1}{p}+\frac{cv}{n} < 1 $$

When handling large input, the throughput may be limited by cache miss and page faults. In other words, the throughput hits memory bound. So the theoretical parallel time rate is smaller than the practial one.

Nugine avatar Nov 27 '22 05:11 Nugine

just bench static-experimental --bench base64 --plotting-backend disabled -- 'base64-encode/base64-simd'

base64-encode (GiB/s)

16 32 64 256 1024 4096 65536 262144 524288 1048576
base64-simd/auto 1.827 1.977 3.503 8.008 11.826 12.980 13.494 12.852 12.751 12.723
base64-simd/parallel 1.174 0.004 0.008 0.029 0.114 0.433 5.623 16.196 24.001 31.267

Nugine avatar Nov 29 '22 07:11 Nugine

Would it be useful to have option to parse multiple items, which the cache and instruction level parallelism can probably help here.

Use case will be something like being able to use it in polars to read a column of uuid.

pickfire avatar Jun 27 '24 11:06 pickfire

Would it be useful to have option to parse multiple items, which the cache and instruction level parallelism can probably help here.

Use case will be something like being able to use it in polars to read a column of uuid.

Could you explain this use case in more detail? How can we accelerate it by multithreading?

Nugine avatar Jun 29 '24 09:06 Nugine

Not really multi-threading but aligning on cache lines and being able to process multiple items at once maybe can make it faster? When we need to process batches of it, maybe there are better way to process items in bulk compared to processing each item one by one?

pickfire avatar Jul 25 '24 02:07 pickfire

Not really multi-threading but aligning on cache lines and being able to process multiple items at once maybe can make it faster? When we need to process batches of it, maybe there are better way to process items in bulk compared to processing each item one by one?

Sounds interesting. We can discuss it in another issue #45

Nugine avatar Jul 25 '24 04:07 Nugine