ADIOS2 icon indicating copy to clipboard operation
ADIOS2 copied to clipboard

Deferred span-based Engine::Put()

Open franzpoeschel opened this issue 3 years ago • 4 comments

Observed during the implementation of the span-based storeChunk()-API in openPMD, copied over here to have this behavior documented in the ADIOS2 issues, too. (This feature request has no high priority for us, but I still wanted to have this reported.)

The span-based Engine::Put() API (template <class T> typename Variable<T>::Span Put(Variable<T> variable)) works in Mode::Sync mode with no way to switch to some notion of Mode::Deferred. From the first look at it, this looks like the only sensible decision: The idea of Span is to wrap a non-owning pointer in a C++ class, so this operation seemingly needs to go in sync mode, otherwise it couldn't return a pointer.

But adios2::Variable<T>::Span doesn't actually work exactly that way. Reallocations of the ADIOS2 buffer may happen in between two calls to span.data() on the same span object and adios2::…::Span is aware of that.

Example for an application where this feature would allow to write programs with vastly better performance (by default):

#include <adios2.h>
#include <iostream>
#include <numeric>
#include <vector>

#define VERBOSE 1

int main( int argsc, char ** argsv )
{
    std::string engine_type = "bp4";

    adios2::ADIOS adios;
    adios2::IO IO = adios.DeclareIO( "IO" );
    IO.SetEngine( engine_type );
    adios2::Engine engine = IO.Open( "span.bp", adios2::Mode::Write );

    // write 6GB of data in 1GB pieces
    using datatype = char;
    size_t extent = 1024 * 1024 * 1024;

    std::vector< adios2::detail::Span< datatype > > spans;
    spans.reserve( 6 );

    // first phase: get spans, but don't write anything yet
    for( size_t i = 0; i < 6; ++i )
    {
        std::string variableName = "variable" + std::to_string( i );
        auto variable = IO.DefineVariable< datatype >(
            variableName, { extent }, { 0 }, { extent } );
        // this call doesn't technically need to allocate memory yet
        // but it does
        spans.emplace_back( engine.Put( variable ) );
#if VERBOSE
        std::cout << "[Phase 1, Span " << i
                  << "]: " << ( void * )spans[ i ].data() << std::endl;
#endif
    }

    // second phase: fill spans with user data
    for( size_t i = 0; i < 6; ++i )
    {
        auto & span = spans[ i ];
#if VERBOSE
        std::cout << "[Phase 2, Span " << i
                  << "]: " << ( void * )spans[ i ].data() << std::endl;
#endif
        std::fill( span.begin(), span.end(), i );
    }

    // third phase: let the engine read what we wrote
    engine.Close();
}

If VERBOSE=1, the output demonstrates that the pointers of the same Spans can be different over time:

[Phase 1, Span 0]: 0x7f06c5cce0c5
[Phase 1, Span 1]: 0x7f0682bac10c
[Phase 1, Span 2]: 0x7f0601141153
[Phase 1, Span 3]: 0x7f053d91419a
[Phase 1, Span 4]: 0x7f06bcca81e1
[Phase 1, Span 5]: 0x7f056a346228
[Phase 2, Span 0]: 0x7f042a3460c5
[Phase 2, Span 1]: 0x7f046a34610c
[Phase 2, Span 2]: 0x7f04aa346153
[Phase 2, Span 3]: 0x7f04ea34619a
[Phase 2, Span 4]: 0x7f052a3461e1
[Phase 2, Span 5]: 0x7f056a346228

But even if VERBOSE=0 (i.e. noone actually calls span.data() in the first loop), the memory for the spans is allocated eagerly, leading to many reallocations as KDE Heaptrack demonstrates: Bildschirmfoto von 2021-03-15 14-57-49

TLDR: Due to reallocations, users already must call span.data() right before writing to a span, otherwise they might have a stale pointer. This means that the underlying memory needs not be allocated before the first call to span.data() (or span.operator[]() or similar), allowing to build a Mode::Deferred Span API.

Why is this feature important? It allows coupling the benefits of the Span-based API (less memory usage) with the benefits of the Deferred API (less reallocations). What is the potential impact of this feature in the community? Since the BP5 serializer might eliminate those problems in a more elegant way, this is debatable. Until that is usable, better performance by default for users of the Span API. Is your feature request related to a problem? Please describe. Observed while working on this PR. I switched a Python script in the openPMD tooling (openpmd-pipe, used to copy a dataset from one backend to another) to the Span API. Effect: Worse performance by default. Effect if specifying InitialBuffersize correctly: Cuts memory usage in half. Previously, InitialBuffersize did not need to be specified since we use the deferred API, making the first allocation happen once all data is known. Describe the solution you'd like and potential required effort If this is considered an important feature, a deferred version of Span-based Engine::Put(). Effort depends on the serializer internals. Describe alternatives you've considered and potential required effort Short-term: Deferred API and span-based API are mutually exclusive. Long-term: BP5 serializer

franzpoeschel avatar Apr 01 '21 13:04 franzpoeschel

@franzpoeschel thanks for sharing your thoughts. Yes, span is a special case. The trade-off of move semantics (to return a Span object) is that is not compatible with deferred mode. Adding an extra Put signature that takes a span would do it, but as you mentioned setting the initial buffer size would address the issue of frequent reallocation for this particular use-case (one big step). In the case of streaming that cost is amortized if writing the same (or less) amount of data.

If the goal is to get maximum performance from defaults that would depend highly on a case by case basis. It would be interesting to find out at which amount of memory on a certain platform (Summit) would make sense to set an Initial buffer. Thanks!

williamfgc avatar Apr 01 '21 14:04 williamfgc

Yes, the Span is returned by value. My point was that the Span class does not actually store a concrete pointer, merely a payload position. So, a pointer is not returned by value, but computed dynamically (deferred, if you will). The real memory behind the span needs not be allocated at that point yet. So, I think that it should even be possible to implement this while keeping the exact same API (that was the motivating point for this feature request). The question is whether that would be worth the effort.

franzpoeschel avatar Apr 01 '21 16:04 franzpoeschel

@franzpoeschel yes, that's a great idea. There is a few signatures for data access as well. Also, would the first call to data access allocate all the expected memory (from other spans) or just for that span?

williamfgc avatar Apr 06 '21 00:04 williamfgc

Also, would the first call to data access allocate all the expected memory (from other spans) or just for that span?

The three-phase approach shown in my code example above would work efficiently for both options, so I suppose either option would be fine. Given the ADIOS2 internals, I think allocating all memory at once would be easier? Also, I don't think that there is a real benefit to allocating memory just for the span that triggered the allocation.

franzpoeschel avatar Apr 06 '21 10:04 franzpoeschel