ADIOS2
ADIOS2 copied to clipboard
Deferred span-based Engine::Put()
Observed during the implementation of the span-based storeChunk()
-API in openPMD, copied over here to have this behavior documented in the ADIOS2 issues, too.
(This feature request has no high priority for us, but I still wanted to have this reported.)
The span-based Engine::Put()
API (template <class T> typename Variable<T>::Span Put(Variable<T> variable)
) works in Mode::Sync
mode with no way to switch to some notion of Mode::Deferred
.
From the first look at it, this looks like the only sensible decision: The idea of Span
is to wrap a non-owning pointer in a C++ class, so this operation seemingly needs to go in sync mode, otherwise it couldn't return a pointer.
But adios2::Variable<T>::Span
doesn't actually work exactly that way. Reallocations of the ADIOS2 buffer may happen in between two calls to span.data()
on the same span
object and adios2::…::Span
is aware of that.
Example for an application where this feature would allow to write programs with vastly better performance (by default):
#include <adios2.h>
#include <iostream>
#include <numeric>
#include <vector>
#define VERBOSE 1
int main( int argsc, char ** argsv )
{
std::string engine_type = "bp4";
adios2::ADIOS adios;
adios2::IO IO = adios.DeclareIO( "IO" );
IO.SetEngine( engine_type );
adios2::Engine engine = IO.Open( "span.bp", adios2::Mode::Write );
// write 6GB of data in 1GB pieces
using datatype = char;
size_t extent = 1024 * 1024 * 1024;
std::vector< adios2::detail::Span< datatype > > spans;
spans.reserve( 6 );
// first phase: get spans, but don't write anything yet
for( size_t i = 0; i < 6; ++i )
{
std::string variableName = "variable" + std::to_string( i );
auto variable = IO.DefineVariable< datatype >(
variableName, { extent }, { 0 }, { extent } );
// this call doesn't technically need to allocate memory yet
// but it does
spans.emplace_back( engine.Put( variable ) );
#if VERBOSE
std::cout << "[Phase 1, Span " << i
<< "]: " << ( void * )spans[ i ].data() << std::endl;
#endif
}
// second phase: fill spans with user data
for( size_t i = 0; i < 6; ++i )
{
auto & span = spans[ i ];
#if VERBOSE
std::cout << "[Phase 2, Span " << i
<< "]: " << ( void * )spans[ i ].data() << std::endl;
#endif
std::fill( span.begin(), span.end(), i );
}
// third phase: let the engine read what we wrote
engine.Close();
}
If VERBOSE=1
, the output demonstrates that the pointers of the same Span
s can be different over time:
[Phase 1, Span 0]: 0x7f06c5cce0c5
[Phase 1, Span 1]: 0x7f0682bac10c
[Phase 1, Span 2]: 0x7f0601141153
[Phase 1, Span 3]: 0x7f053d91419a
[Phase 1, Span 4]: 0x7f06bcca81e1
[Phase 1, Span 5]: 0x7f056a346228
[Phase 2, Span 0]: 0x7f042a3460c5
[Phase 2, Span 1]: 0x7f046a34610c
[Phase 2, Span 2]: 0x7f04aa346153
[Phase 2, Span 3]: 0x7f04ea34619a
[Phase 2, Span 4]: 0x7f052a3461e1
[Phase 2, Span 5]: 0x7f056a346228
But even if VERBOSE=0
(i.e. noone actually calls span.data()
in the first loop), the memory for the spans is allocated eagerly, leading to many reallocations as KDE Heaptrack demonstrates:
TLDR: Due to reallocations, users already must call span.data()
right before writing to a span, otherwise they might have a stale pointer. This means that the underlying memory needs not be allocated before the first call to span.data()
(or span.operator[]()
or similar), allowing to build a Mode::Deferred
Span API.
Why is this feature important?
It allows coupling the benefits of the Span-based API (less memory usage) with the benefits of the Deferred API (less reallocations).
What is the potential impact of this feature in the community?
Since the BP5 serializer might eliminate those problems in a more elegant way, this is debatable. Until that is usable, better performance by default for users of the Span API.
Is your feature request related to a problem? Please describe.
Observed while working on this PR. I switched a Python script in the openPMD tooling (openpmd-pipe
, used to copy a dataset from one backend to another) to the Span API. Effect: Worse performance by default. Effect if specifying InitialBuffersize
correctly: Cuts memory usage in half.
Previously, InitialBuffersize
did not need to be specified since we use the deferred API, making the first allocation happen once all data is known.
Describe the solution you'd like and potential required effort
If this is considered an important feature, a deferred version of Span-based Engine::Put()
. Effort depends on the serializer internals.
Describe alternatives you've considered and potential required effort
Short-term: Deferred API and span-based API are mutually exclusive.
Long-term: BP5 serializer
@franzpoeschel thanks for sharing your thoughts. Yes, span is a special case. The trade-off of move semantics (to return a Span object) is that is not compatible with deferred mode. Adding an extra Put
signature that takes a span would do it, but as you mentioned setting the initial buffer size would address the issue of frequent reallocation for this particular use-case (one big step). In the case of streaming that cost is amortized if writing the same (or less) amount of data.
If the goal is to get maximum performance from defaults that would depend highly on a case by case basis. It would be interesting to find out at which amount of memory on a certain platform (Summit) would make sense to set an Initial buffer. Thanks!
Yes, the Span is returned by value. My point was that the Span
class does not actually store a concrete pointer, merely a payload position. So, a pointer is not returned by value, but computed dynamically (deferred, if you will). The real memory behind the span needs not be allocated at that point yet.
So, I think that it should even be possible to implement this while keeping the exact same API (that was the motivating point for this feature request). The question is whether that would be worth the effort.
@franzpoeschel yes, that's a great idea. There is a few signatures for data access as well. Also, would the first call to data access allocate all the expected memory (from other spans) or just for that span?
Also, would the first call to data access allocate all the expected memory (from other spans) or just for that span?
The three-phase approach shown in my code example above would work efficiently for both options, so I suppose either option would be fine. Given the ADIOS2 internals, I think allocating all memory at once would be easier? Also, I don't think that there is a real benefit to allocating memory just for the span that triggered the allocation.