fst
fst copied to clipboard
Question: What are the minimum and maximum number of cores I need for which fst's performance is optimized?
Hi all - Thank you for developing this very cool package!
I run my processing / analysis workflow on my work's HPC cluster.
Usually I distribute the work by simply dividing a given dataset into d partitions, and then giving that to n nodes.
Typically I don't give my worker nodes very many cores, though I'm flexible on how much I ask for for my master node.
In your docs, you state that the performance gains are largely due to parallelization. My questions:
- How many cores do I need to start seeing significant performance improvements when using
fst? - How does the single-core performance of
fstcompare to other serialization mechanisms? - When does
fststart to see diminishing returns in parallelization / What is the max number of threads/ processesfstuses?
Thank you!
Hi @mstr3336, thanks for your interesting questions!
The (de-) serialization performance of fst on your system (or node) depends on multiple factors:
- Firstly, the maximum speed of the storage device (SSD).
fstcan't push more bytes through your system than this maximum, so that's a hard limit. - But to increase throughput, the data stream can be compressed. That will increase single core performance when the time it takes to reduce the data stream is smaller than the time it takes to write that same amount to disk.
- When using multiple cores, the actual writing to disk is done with a single thread, and compression can be done on (multiple) background threads. So in that case, speed will be highest when all background threads are 100% utilized to compress the data stream. So the compression factor has to be chosen such that compression takes exactly the same amount of time as writing these bytes to disk.
- So if you have many cores, a higher compression factor can be selected to increase the write speed. With
fst, decompression will also take more CPU for higher compression settings, but this is a much smaller effect due to the characteristics of the compression algorithms used (LZ4andZSTD). - There is a small overhead when using many threads (due to OpenMP). So at some point, adding more threads will actually lower performance. But in most cases, adding threads up to the physical number of cores in your system will only increase performance for (higher compression settings).
So short story is that it depends a lot on your system :-). So it's probably best to start with the default compression settings and all available cores and increase the compression stepwise to see where that gets you!
(this could be done automatically, but as fst files can be read by a completely different system at a later time, we can't know the optimal settings at write time...)
thanks