openPMD-api
openPMD-api copied to clipboard
Mapping between ADIOS steps and openPMD iterations
Background: Until now, our Streaming API assumes that each ADIOS step corresponds with exactly one openPMD iteration and that those iterations are found in ascending order. Once we expose the ADIOS2 Append access mode, this will not necessarily hold true any longer, so this PR explores more flexible alternatives.
Scenario: Run a simulation with data output all 50 steps, checkpoints all 500 steps, use step-based iteration layout (or group-based iteration layout and activate ADIOS steps). Crash at step 750, restart from 500. Data output then needs to be appended to the (single file!) output.bp
. From the first run, we have the following:
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750
When appending, we cannot remove any old steps, just append new ones. So, our file will look like:
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 500 550 600 650 700 …
Goal: Be able to read that.
First step (useful independent of this issue): Annotate for each ADIOS step the openPMD iteration defined by it.
My current approach is to use the ~~/data/__step__
~~ /data/snapshot
attribute introduced by #855 and use it to store the openPMD iteration(s) stored in the current ADIOS step. Afterwards, the reading procedures can inquire that attribute and see which iteration they should return to the user. Fallback to the old solution if the attribute isn't found.
TODO:
- [ ] reading: as long as we have no ADIOS append-and-truncate for writes, we need to pick the "last" key in
/data/snapshot
when readingseries.iteration[key]
fixing this is for a follow-up PR - [x] Testing, documentation, cleanup, edge cases
- [x] Merge #1007 first
- [x] Add some further iterations to appending test
- [x] Fix todos (see in-line comments)
- [x] Check: snapshot attribute in RW mode (don't set for existing iterations)
- [ ] update standard: WIP in https://github.com/ax3l/openPMD-standard/pull/1, https://github.com/openPMD/openPMD-standard/pull/250:
- [x] See latest bugfix in topic-read-leniently
- [x] merge #1218 first
- [x] merge #1302 first
- [x] see https://github.com/openPMD/openPMD-standard/pull/250/files#r949996254, https://github.com/openPMD/openPMD-standard/pull/250/files#r949999743
For the group/variable based files, Is there an option to not write an iteration if it already exists and is valid?
For the group/variable based files, Is there an option to not write an iteration if it already exists and is valid?
Do you mean when appending to an existing Series? If yes, that's a bit challenging, as ADIOS Append mode does not give any read access and openPMD has no handling for redundantly defined iterations yet. So short answer: No, but I want to look at that specific situation again once #1007 and this PR are merged
For the group/variable based files, Is there an option to not write an iteration if it already exists and is valid?
Do you mean when appending to an existing Series? If yes, that's a bit challenging, as ADIOS Append mode does not give any read access and openPMD has no handling for redundantly defined iterations yet. So short answer: No, but I want to look at that specific situation again once #1007 and this PR are merged
Oh, I was referring to in your example, restarting at check point 500, which is a few steps before the latest iteration 750. Guess it is a bit of a work to read contents from the existing file first and then ask adios to append at the right place.
Maybe in the future an alternative is to consider to have one file per checkpoint. This way there is no need to append. Always start a new file at checkpoint.
Oh, I was referring to in your example, restarting at check point 500, which is a few steps before the latest iteration 750. Guess it is a bit of a work to read contents from the existing file first and then ask adios to append at the right place.
In that case, it would be better to overwrite the old data with the new one For that, we will either have to truncate the old file at write time, or to skip redundant iterations at read time. Either option is not entirely trivial and might require further support from Adios
Maybe in the future an alternative is to consider to have one file per checkpoint. This way there is no need to append. Always start a new file at checkpoint.
That would still give you redundantly defined iterations which need to be handled at read time somehow, while adding the additional complexity of needing to handle several files. I'm not sure there would be any benefit to that approach?
That would still give you redundantly defined iterations which need to be handled at read time somehow, while adding the additional complexity of needing to handle several files. I'm not sure there would be any benefit to that approach?
If every checkpoint has its own file, there is no append needed. At restart always overwrites, so we shall not see redundant iterations. e.g. in your example, file0_500.bp, file550_1000.bp, etc. If crash happened at step 750, restart at step 500. and rewrite file550_1000.bp.
Yes it needs to add new support to read this set of files.
ADIOS is not likely to support remove/update functions as far as I can see. Just my two cents to work around it.
That would still give you redundantly defined iterations which need to be handled at read time somehow, while adding the additional complexity of needing to handle several files. I'm not sure there would be any benefit to that approach?
If every checkpoint has its own file, there is no append needed. At restart always overwrites, so we shall not see redundant iterations. e.g. in your example, file0_500.bp, file550_1000.bp, etc. If crash happened at step 750, restart at step 500. and rewrite file550_1000.bp.
This is not about checkpoints. It's about what happens to regular data output when restarting from a checkpoint. Checkpoints already usually work the way you describe and there's no reason to change that.
But when restarting from a checkpoint, you get an "overlap zone" where output steps are written a second time. This is tricky to handle in group/variable-based iteration encodings. This PR is a first step toward solving that, though it is not yet a solution.
Yes it needs to add new support to read this set of files.
ADIOS is not likely to support remove/update functions as far as I can see. Just my two cents to work around it.
Norbert did suggest a truncate option for appending once. Alternatively, we can eliminate duplicate iterations on our own at read time.
Note: ~~Please don't review this just yet, I'm currently rebasing this upon #1007 because this needs additional logic to deal with files created via Append mode.~~
Ok, things are fixed now, the other PR should still go first anyway
Note: BP5 now has a feature for truncation upon appending.
It should already be possible to manually use this e.g. by specifying {"adios": {"engine": {"parameters": {"AppendAfterSteps": "-3"}}}}
.
Possible improvement for a follow-up PR: Handle datasets with duplicate iterations more gracefully, e.g. by skipping any but the last instance of the iteration (currently: any but the first is skipped)
Note: This PR solves two forms of dataset layouts:
- Datasets in which iterations are written in non-chronological order (e.g. iterations are written in order 1,2,4,3,5)
- With the Append mode, it's possible in ADIOS2 to create datasets that contain the same iteration twice in different steps.
In ADIOS2, such a dataset can only be successfully read if the snapshot
attribute is defined:
- If the snapshot attribute is defined: openPMD knows which iteration is contained in the current step and can give it to the user
- If the snapshot attribute is not defined:
- Backends other than ADIOS2 allow random-accessing their data and we just iterate through the iterations in rising order
- In ADIOS2, the dataset is read step by step, but openPMD does not know which step has which iteration. As a fallback, we assume that the iterations are in rising order, creating a mismatch in such datasets.
Opening such a Series anyway might look somewhat like this (using the error-recovery abilities of #1237):
openpmd-ls ../samples/append_groupbased.bp
openPMD series: append_groupbased
openPMD standard: 1.1.0
openPMD extensions: 0
data author: unknown
data created: 2022-05-17 12:20:15 +0000
data backend: ADIOS2
generating machine: unknown
generating software: openPMD-api (version: 0.15.0-dev)
generating software dependencies: unknown
number of iterations: 8 (groupBased)
all iterations: 0 1 2 3 4 [ADIOS2] Warning: Attribute with name /data/7/meshes/E/x/value has no type in backend.
Cannot read record component 'x' and will skip it due to read error:
Read Error in backend ADIOS2
Object type: Attribute
Error type: NotFound
Further description: /data/7/meshes/E/x/value
[ADIOS2] Warning: Attribute with name /data/7/meshes/E/x/value has no type in backend.
Cannot read record component 'x' and will skip it due to read error:
Read Error in backend ADIOS2
Object type: Attribute
Error type: NotFound
Further description: /data/7/meshes/E/x/value
7 10 [ADIOS2] Warning: Attribute with name /data/11/meshes/E/x/value has no type in backend.
Cannot read record component 'x' and will skip it due to read error:
Read Error in backend ADIOS2
Object type: Attribute
Error type: NotFound
Further description: /data/11/meshes/E/x/value
[ADIOS2] Warning: Attribute with name /data/11/meshes/E/x/value has no type in backend.
Cannot read record component 'x' and will skip it due to read error:
Read Error in backend ADIOS2
Object type: Attribute
Error type: NotFound
Further description: /data/11/meshes/E/x/value
11
number of meshes: 1
all meshes:
E
number of particle species: 0
Solution to this:
- In the current release cycle: To workaround this, we should resolve #1274 in the current release cycle and add a random-access read mode to ADIOS2.
-
In the next release cycle: We will revise the ADIOS2 schemas and hopefully make the capabilities of
snapshot
available to users on a broader scale. Currently, those features are only available in the experimental new ADIOS2 schema.
Next note: Once this PR and #1218 are merged, the append_mode
test should be extended to test the AppendAfterSteps
option of BP5. Both PRs are necessary to do that properly, but before the upcoming release we should make sure that truncation works as expected.
Somehow the Icpc issue is back and I can even reproduce it locally. Does somehow not fully deactivate the warning, will have a look..
With -Werror
, I get the same build failure even on dev. Given that this warning is a compiler bug, should we maybe just deactivate the -Werror
flag for now?
EDIT: Latest commit uses -Werror -wd1011
, this fixes it
-------------------------------------------------------------------------------
append_mode
-------------------------------------------------------------------------------
/home/runner/work/openPMD-api/openPMD-api/test/SerialIOTest.cpp:5886
...............................................................................
/home/runner/work/openPMD-api/openPMD-api/test/SerialIOTest.cpp:5886: FAILED:
{Unknown expression after the reported line}
due to unexpected exception with message:
ERROR: attribute /date has been defined and its value cannot be changed, in
call to DefineAttribute
This sporadically happens with this PR in the ASAN UBSAN CI run Apparently, putting verbose output in the relevant test avoids the issue, making this hard to debug in the CI I'll have to try recreating the environment locally, I wasnt able to reproduce this so far
I've recreated the environment and reproduced the issue locally now, it seems like an ADIOS2 glitch:
~│ │#0 0x00007f21bac88672 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6 [272/304]
~│ │#1 0x00007f21b9d0064e in adios2::core::Attribute<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >& adios2::c
~│ │ore::IO::DefineAttribute<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::__cxx11::basic_string<char, st
~│ │d::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, s
~│ │td::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>
~│ │, std::allocator<char> >) () from /usr/local/lib/../lib/libadios2_core.so.2
~│ │#2 0x00007f21ba0436a5 in void adios2::format::BP4Deserializer::DefineAttributeInEngineIO<std::__cxx11::basic_string<char, std::char_traits<ch
1│ Dump of assembler code for function __cxa_throw: │ar>, std::allocator<char> > >(adios2::format::BPBase::ElementIndexHeader const&, adios2::core::Engine&, std::vector<char, std::allocator<char>
2│ 0x00007f21bac88660 <+0>: endbr64 │ > const&, unsigned long) const () from /usr/local/lib/../lib/libadios2_core.so.2
3│ 0x00007f21bac88664 <+4>: push %r13 │#3 0x00007f21b9fa6da7 in adios2::format::BP4Deserializer::ParseAttributesIndexPerStep(adios2::format::BufferSTL const&, adios2::core::Engine&
4│ 0x00007f21bac88666 <+6>: mov %rdx,%r13 │, unsigned long, unsigned long) () from /usr/local/lib/../lib/libadios2_core.so.2
5│ 0x00007f21bac88669 <+9>: push %r12 │#4 0x00007f21b9fa5f48 in adios2::format::BP4Deserializer::ParseMetadata(adios2::format::BufferSTL const&, adios2::core::Engine&, bool) () fro
6│ 0x00007f21bac8866b <+11>: mov %rsi,%r12 │m /usr/local/lib/../lib/libadios2_core.so.2
7│ 0x00007f21bac8866e <+14>: push %rbp │#5 0x00007f21b9de34b1 in adios2::core::engine::BP4Reader::InitBuffer(std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::dur
8│ 0x00007f21bac8866f <+15>: mov %rdi,%rbp │ation<double, std::ratio<1l, 1000000000l> > > const&, std::chrono::duration<double, std::ratio<1l, 1l> > const&, std::chrono::duration<double,
9│ 0x00007f21bac88672 <+18>: nop │ std::ratio<1l, 1l> > const&) () from /usr/local/lib/../lib/libadios2_core.so.2
10│ 0x00007f21bac88673 <+19>: callq 0x7f21bac7aa70 <__cxa_get_globals@plt> │#6 0x00007f21b9de1d82 in adios2::core::engine::BP4Reader::Init() () from /usr/local/lib/../lib/libadios2_core.so.2
11│ 0x00007f21bac88678 <+24>: mov %r13,%rdx │#7 0x00007f21b9ddfb65 in adios2::core::engine::BP4Reader::BP4Reader(adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char
12│ 0x00007f21bac8867b <+27>: mov %r12,%rsi │>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm) () from /usr/local/lib/../lib/libadios2_core.so.2
13│ 0x00007f21bac8867e <+30>: mov %rbp,%rdi │#8 0x00007f21b9d11229 in std::shared_ptr<adios2::core::Engine> adios2::core::IO::MakeEngine<adios2::core::engine::BP4Reader>(adios2::core::IO
14│ 0x00007f21bac88681 <+33>: addl $0x1,0x8(%rax) │&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm) () from /usr/lo
15│ 0x00007f21bac88685 <+37>: callq 0x7f21bac7aae0 <__cxa_init_primary_exception@plt> │cal/lib/../lib/libadios2_core.so.2
16│ 0x00007f21bac8868a <+42>: movl $0x1,(%rax) │#9 0x00007f21ba36abb9 in std::_Function_handler<std::shared_ptr<adios2::core::Engine> (adios2::core::IO&, std::__cxx11::basic_string<char, st
17│ 0x00007f21bac88690 <+48>: lea 0x60(%rax),%rbp │d::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm), std::shared_ptr<adios2::core::Engine> (*)(adios2::co
18│ 0x00007f21bac88694 <+52>: mov %rbp,%rdi │re::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm)>::_M_invo
19│ 0x00007f21bac88697 <+55>: callq 0x7f21bac79bb0 <_Unwind_RaiseException@plt> │ke(std::_Any_data const&, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::M
20│ 0x00007f21bac8869c <+60>: mov %rbp,%rdi │ode&&, adios2::helper::Comm&&) () from /usr/local/lib/../lib/libadios2_core_mpi.so.2
21│ 0x00007f21bac8869f <+63>: callq 0x7f21bac78690 <__cxa_begin_catch@plt> │#10 0x00007f21b9cf7174 in adios2::core::IO::Open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios
22│ 0x00007f21bac886a4 <+68>: callq 0x7f21bac78180 <_ZSt9terminatev@plt> │2::Mode, adios2::helper::Comm) () from /usr/local/lib/../lib/libadios2_core.so.2
23│ End of assembler dump. │#11 0x00007f21b9cf8300 in adios2::core::IO::Open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios
~│ │2::Mode) () from /usr/local/lib/../lib/libadios2_core.so.2
~│ │#12 0x00007f21bae606a1 in adios2::IO::Open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mod
~│ │e) () from /usr/local/lib/libadios2_cxx11.so.2
~│ │#13 0x00007f21c49060ce in openPMD::detail::BufferedActions::getEngine (this=0x616000042c80) at /home/franz/git-repos/openPMD-api/src/IO/ADIOS/
~│ │ADIOS2IOHandler.cpp:2472
~│ │#14 0x00007f21c4918945 in openPMD::detail::BufferedActions::configure_IO (this=0x616000042c80, impl=...) at /home/franz/git-repos/openPMD-api/
~│ │src/IO/ADIOS/ADIOS2IOHandler.cpp:2450
** Dump of assembler code for function __cxa_throw: (7f21bac88660 - 7f21bac886a4) ** │#15 0x00007f21c491090c in openPMD::detail::BufferedActions::BufferedActions (this=0x616000042c80, impl=..., file=...) at /home/franz/git-repos
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
...............................................................................
/home/franz/git-repos/openPMD-api/test/SerialIOTest.cpp:5886: FAILED:
{Unknown expression after the reported line}
due to unexpected exception with message:
ERROR: attribute /date has been defined and its value cannot be changed, in
call to DefineAttribute
===============================================================================
test cases: 1 | 1 failed
assertions: 2 | 1 passed | 1 failed
I'll try to find out under which circumstances it happens (ADIOS2 version, ADIOS2 build mode)
Alright, I think I have it figured out:
- The ASAN environment makes the test run slower
- Slow enough that when appending, the
/date
attribute sometimes changes to the next second - When reading, ADIOS2 v2.7 checks that duplicate attributes are identical (due to using the same internal code as for defining attributes in writing) and throws an error
- This behavior seems to be fixed in ADIOS2 v2.8
- It's easy to trigger this behavior by running other CI configurations with ADIOS2 v2.7 under valgrind
The best fix for this would be by avoiding to write duplicate attributes, need to check if that is possible some way
The issue is already present in dev, I'll try going for a fix in #1218 because it has some features that should help in fixing this.
I've addressed all review comments and cleaned up the commit history today. Commit descriptions are mostly very detailed. Tests ran green, so ready for review :) @ax3l