fst icon indicating copy to clipboard operation
fst copied to clipboard

Question: What are the minimum and maximum number of cores I need for which fst's performance is optimized?

Open strazto opened this issue 5 years ago • 1 comments

Hi all - Thank you for developing this very cool package!

I run my processing / analysis workflow on my work's HPC cluster. Usually I distribute the work by simply dividing a given dataset into d partitions, and then giving that to n nodes.

Typically I don't give my worker nodes very many cores, though I'm flexible on how much I ask for for my master node.

In your docs, you state that the performance gains are largely due to parallelization. My questions:

  1. How many cores do I need to start seeing significant performance improvements when using fst?
  2. How does the single-core performance of fst compare to other serialization mechanisms?
  3. When does fst start to see diminishing returns in parallelization / What is the max number of threads/ processes fst uses?

Thank you!

strazto avatar Apr 30 '20 09:04 strazto

Hi @mstr3336, thanks for your interesting questions!

The (de-) serialization performance of fst on your system (or node) depends on multiple factors:

  1. Firstly, the maximum speed of the storage device (SSD). fst can't push more bytes through your system than this maximum, so that's a hard limit.
  2. But to increase throughput, the data stream can be compressed. That will increase single core performance when the time it takes to reduce the data stream is smaller than the time it takes to write that same amount to disk.
  3. When using multiple cores, the actual writing to disk is done with a single thread, and compression can be done on (multiple) background threads. So in that case, speed will be highest when all background threads are 100% utilized to compress the data stream. So the compression factor has to be chosen such that compression takes exactly the same amount of time as writing these bytes to disk.
  4. So if you have many cores, a higher compression factor can be selected to increase the write speed. With fst, decompression will also take more CPU for higher compression settings, but this is a much smaller effect due to the characteristics of the compression algorithms used (LZ4 and ZSTD).
  5. There is a small overhead when using many threads (due to OpenMP). So at some point, adding more threads will actually lower performance. But in most cases, adding threads up to the physical number of cores in your system will only increase performance for (higher compression settings).

So short story is that it depends a lot on your system :-). So it's probably best to start with the default compression settings and all available cores and increase the compression stepwise to see where that gets you!

(this could be done automatically, but as fst files can be read by a completely different system at a later time, we can't know the optimal settings at write time...)

thanks

MarcusKlik avatar May 01 '20 22:05 MarcusKlik