fdb icon indicating copy to clipboard operation
fdb copied to clipboard

Warn / Error on bad schema

Open d70-t opened this issue 2 years ago • 1 comments

Is your feature request related to a problem? Please describe.

I tried building a custom schema (and custom message types, but that's irrelevant here) for FDB. I came up with a schema similar to:

[a[b?[c]]]
[a[b[d]]]

I.e. if c is present in the key, b is optional, but if d is present, b is required. I expected this to work, however the following sampe code fails:

#include <fdb5/api/FDB.h>
#include <eckit/runtime/Main.h>

int main(int argc, char** argv) {
    eckit::Main::initialise(argc, argv);
    auto fdb = fdb5::FDB{};
    fdb.archive(fdb5::Key({{"a", "1"}, {"b", "2"}, {"c", "4"}}), "foo", 4);  // works fine
    fdb.archive(fdb5::Key({{"a", "1"}, {"b", "2"}, {"d", "4"}}), "bar", 4);  // crashes
}

when compiled & run as follows:

g++ -o badschema badschema.cpp -lfdb5 -leckit && ./badschema

The reported error is:

terminate called after throwing an instance of 'eckit::SeriousBug'
  what():  SeriousBug: Key::get() failed for [c] in {a=1,b=2,d=4}  in  (/src/fdb/src/fdb5/database/Key.cc +192 get)

I.e. FDB tries to get the value for "c" in the second archive call, I guess because it accidentally tries to match it agains the first (instead of the second) schema rule.

I've been in contact with @simondsmart, who suggested to use the schema

[a[b?[c][d]]]

instead. However, the test code fails with the same issue (also the schema doesn't encode that b would be required for d).

Describe the solution you'd like

I'd like to see both of the two schemas above work with the provided example code.

Describe alternatives you've considered

If it's impossible to make those schema work, FDB should generate an understandable error message, explaining that the schema is invalid instead of accepting some keys for storage before crashing with other keys.

Additional context

No response

Organisation

MPIM

d70-t avatar Sep 19 '23 17:09 d70-t

Sorry for the very slow response. I got pulled aside on some joyous internal distractions.

We need to be very careful about describing things as "bad schema" rather than "behaviour that I didn't expect".

The three levels of the schema mean different things. When using the filesystem backend:

  1. Identifies the directory the data is stored in
  2. Identifies the subsets of data that will be collocated in the same data (and index) files
  3. Identifies the values that can vary within one collocated dataset

These have consequences.

For the first and second levels, the hierarchical search pattern exists for (practical and technical) reasons. Matching is done from the top of the schema downwards. When things match, then further matching stops. This allows us to specify schemas starting with the most specific rules at the top (for instance "[ class=od, expver, stream=oper/dcda/scda, date, time, domain?") with more generic rules further down ("class, expver, stream, date, time, domain?"). A consequence of this is that we can't directly duplicate rules without making them become more generic as we go.

Once we get to the third level of the schema, this level identifies the meaning of the values stored in the index. We are no longer identifying which data file/index we are using. As such, it doesn't have any meaning to have multiple different alternatives. So we can only have one option at that level.

The error which is thrown in this case is correct. Your key has matched on [a] and [b]. Given that, it is required (by the schema) to supply key c. Key c is not supplied. And as such, this fails. To be able to use this hierarchy of keys, with (IIRC) the need for b to be optional, you would need to use the schema

[ a [ b? [ c?, d? ]]]

I presume, however, that this schema mechanism with a, b, c, d is a reduction of a real problem. And I would very much suggest that you elaborate what you are trying to do, as I suspect that this suggested schema is unlikely to be useful for a realistic problem - it just looks a bit weird. If you can let me know what keys you are trying to archive with - and crucially what the write pattern, and the distribution of the values amongst those keys are, then we can settle on something a bit more optimal.

simondsmart avatar Oct 03 '23 23:10 simondsmart