picongpu [WIP] openPMD plugin: Flush data to disk within a step

The upcoming BP5 engine in ADIOS2 has some features for saving memory compared to BP4.

BP5 will not replace BP4 because these memory optimizations come at a runtime cost, instead users will be able to decide between runtime efficiency and memory efficiency.

One feature that we asked for and is now implemented is the ability to flush data to disk within a single IO step. I'm currently working on exposing this functionality in openPMD. Together with that PR, this PR makes the feature available as a preview in PIConGPU.

Pinging @psychocoderHPC because he asked for this feature

TODO:

[x] Merge https://github.com/openPMD/openPMD-api/pull/1207 in openPMD-api
[x] Resolve https://github.com/openPMD/openPMD-api/issues/1205 in openPMD-api before starting to use BP5 in production workflows, otherwise there will be too much confusion
[x] Maybe wait for ADIOS 2.8.0 release which will contain the BP5 engine for the first time
[x] There might still be API changes in the openPMD-api PR, adapt to them
[ ] Parallel testing

First results

I ran 4 tests, each one writing 3 IO steps, bit more than 15Gb per step:

BP4 engine without InitialBufferSize
BP4 engine with correctly specified InitialBufferSize
BP5 engine without this PR
BP5 engine with this PR, aggressively write to disk as often as possible

The memory profiles of each run are seen in the following screenshot line by line, note the different y scales

( 1 | 2 )
( 3 | 4 )

Bildschirmfoto vom 2022-02-28 15-30-54

Further details: Bildschirmfoto vom 2022-02-28 15-30-35

Interpretation:

Known pathological behavior if not specifying InitialBufferSize, don't do this
Best speed, but high memory usage and necessity to specify InitialBufferSize beforehand
Buffers are allocated as needed, memory usage is equivalent to 2. if InitialBufferSize is specified as the perfect amount
Lowest memory usage, but long runtime due to many little write operations. Compared to the current ADIOS2 output, this saves ~15Gb of peak memory usage

As it stands, the runtime duration of BP5 approaches is very long in these benchmarks. The parameters of the BP5 engine are not yet documented, so I have not really had the chance to tune this yet.

Feb 28 '22 14:02 franzpoeschel

@pnorbert suggested to specify BufferChunkSize as 2Gb, with this setting I got performance close to BP4 for setup (3), and a bit slower than that for setup (4):

Bildschirmfoto vom 2022-03-01 15-22-38

Mar 01 '22 14:03 franzpoeschel

Note that in the above image the first graph reaches a surprisingly large peak memory consumption of 110GB. This is virtual memory only. Essentially, by specifiying BufferChunkSize=2GB, this requests ADIOS2 to over-allocate memory and use only very little of each chunk. The BP5 engine uses malloc for allocation which is what Heaptrack tracks. Only a little percentage of the malloced memory is translated to physical memory. I confirmed this by monitoring PIConGPU with top while running:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                
4025970 franzpo+  20   0  135.1g  46.1g 136268 R  98.7  36.8   1:31.97 picongpu

In here, the virtual memory usage is even higher than Heaptrack reports, but the physical memory (RES) peaks at 55GB.

This being said, I don't know if SLURM or other Batch systems understand this, i.e. I don't know if they go by physical or virtual memory to monitor memory usage of jobs.

Mar 01 '22 18:03 franzpoeschel

The high virtual memory usage is now fixed in ADIOS2, see the top row in the screenshot: Bildschirmfoto vom 2022-03-03 15-13-33

Also, Norbert told me that there are needless copies in the setup that I use, so I activated the Span-based API for BP5 in openPMD, the result was the bottom row which is actually faster than BP4.

I assume that this is because BP4 initializes 20Gb of memory with zeroes, so the advantage probably will not translate to runs at scale (initialization happens only once, the difference is made more extreme by running under Heaptrack, IO efficiency will dominate over serialization efficiency at scale)

Mar 03 '22 14:03 franzpoeschel

In combination with mapped memory data preparation strategy: Who needs host memory? Bildschirmfoto vom 2022-03-03 15-57-28

Given that we will probably add a third data preparation strategy for Frontier, having memory profiles like this one in the end might not be out of the question.

Mar 03 '22 14:03 franzpoeschel

This PR now has a working suggestion on how to handle different flush targets via JSON configuration

May 09 '22 12:05 franzpoeschel

Thanks for working on this feature, this change is required to support reducing the memory footprint for IO on ORNL crusher/frontier and other systems with a low amount of host memory compared to the GPU memory.

Sry I was not aware that you pushed new changes to this PR. Please ping me next time. I will review this PR next week.

Nov 04 '22 15:11 psychocoderHPC