aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

`Node`: Validate nested attribute and extra keys

Open mbercx opened this issue 2 years ago • 5 comments
trafficstars

Fixes #3863

Currently only the top-level keys of the attributes and extras of a node are validated upon instantiation of the node. Nested keys are only (silently) converted to a string when storing the node. This behaviour of silently mutating the keys is dangerous.

Here we add a recursive validation step to the _validate() method of the Node class, that relies on the already defined validate_attribute_extra_key function. The _validate() method is only called upon storing the node, and hence this adjustment does not cause any significant performance issues (see PR for benchmarks).

Additionally, typing is added to all methods of the Dict class.

mbercx avatar Apr 15 '23 22:04 mbercx

I've done some benchmarks to get an idea of the impact on performance. For an independent recursive function:

def validate_dict(dictionary):
    for key, value in dictionary.items():
        if not isinstance(key, str):
            raise TypeError(f'Found non-string key `{key}` in dictionary with type `{type(key)}`.')
        if isinstance(value, dict):
            validate_dict(value)

and the following dictionaries:

d1 = {str(top): value for top, value in zip(range(100), range(100))}
d2 = {
    str(top): {str(mid): value for mid, value in zip(range(100), range(100))}
    for top in range(100)
}
d3 = {
    str(top): {
        str(mid): {str(deep): value for deep, value in zip(range(100), range(100))}
        for mid in range(100)
    }
    for top in range(100)
}

I get the following timings for the validate_dict function:

d1: 6.45 µs ± 99.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
d2: 665 µs ± 9.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
d3: 67.6 ms ± 647 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So, pretty linear scaling vs the number of keys, which is sort of expected.

When I run

%%timeit -n 10

orm.Dict(<dictionary>).store()

for the three dictionaries, I get the following comparison of performance before and after adding the validation:

d1:
- before: 14.9 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
- after: 12.5 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
d2:
- before: 106 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
- after: 107 ms ± 1.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
d3:
- before: 7.79 s ± 42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
- after: 8.05 s ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So there might be some small influence on the performance supermassive dictionary, but nothing critical.

mbercx avatar Apr 15 '23 23:04 mbercx

@mbercx I selected approve, but of course I meant "request changes" 🤦

sphuber avatar May 17 '23 10:05 sphuber

Just a thought: this change is actually backwards-incompatible. It's possible some plugins rely on automatic casting of the nested Dict keys as strings, and their code might break.

mbercx avatar May 29 '23 22:05 mbercx

Just a thought: this change is actually backwards-incompatible. It's possible some plugins rely on automatic casting of the nested Dict keys as strings, and their code might break.

Coming back to this: this is the reason I didn't include this in v2.4. I think we can simply try/catch the added validation and for now just add a warn.warn_deprecation message saying that this behavior will start to except starting from AiiDA v3.0 Then we can merge this PR. Do you see a problem with that approach @mbercx ?

sphuber avatar Jul 11 '23 15:07 sphuber

Note that I came across an example that is currently problematic but would actually be fixed by this PR. If you construct a Dict with a tuple as a nested key, you can create the node just fine, but when storing, the code will except on the database insertion level. This can leave the Sqlalchemy session in a broken state. Adding this validation before storing would solve this.

sphuber avatar Sep 05 '23 13:09 sphuber