awkward icon indicating copy to clipboard operation
awkward copied to clipboard

Array class validation

Open agoose77 opened this issue 3 years ago • 2 comments

Description of new feature

This issue was originally very long, but the fundamental motivation is quite simple. Right now, behaviour classes cannot perform validation at construction time; the only opportunity for validating the current layout is whenever a user-defined method is called, e.g.:

class OnlyFloats(Array):
    def validate(self):
        dtypes = []
        def get_dtype(layout, **kwargs):
            if isinstance(layout, ak._v2.contents.NumpyArray):
                assert np.issubdtype(layout.dtype, np.floating)
        self.layout.recursively_apply(get_dtype)

ak.behavior["only_floats"] = OnlyFloats
>>> x = ak.with_parameter(Array([1,2,3]), "__array__",  "only_floats")
>>> x.validate()
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_362562/672691759.py in <cell line: 1>()
----> 1 validate(ak._v2.Array(l))
...

/tmp/ipykernel_362562/3059306649.py in get_dtype(layout, **kwargs)
      3     def get_dtype(layout, **kwargs):
      4         if isinstance(layout, ak._v2.contents.NumpyArray):
----> 5             assert np.issubdtype(layout.dtype, np.floating)
      6     self.layout.recursively_apply(get_dtype)

AssertionError:

This means that a user could associate a behaviour with a poorly structured array, and only discover their error at a later point in the program when the validation method is called.

I wonder if we should consider adding a hook that allows behaviours to run user-defined code at the time of array wrapping, e.g. to validate the array:

class Array(...):
    def __init__(self, ...):
        self._on_layout_added()

    def _on_layout_added(self):
        pass


class OnlyFloats(Array):
    def _on_layout_added(self):
        dtypes = []
        def get_dtype(layout, **kwargs):
            if isinstance(layout, ak._v2.contents.NumpyArray):
                assert np.issubdtype(layout.dtype, np.floating)
        self.layout.recursively_apply(get_dtype)

ak.behavior["only_floats"] = OnlyFloats


x = ak.with_parameter(Array([1,2,3]), "__array__",  "only_floats")

This would mean that we don't have to come up with some Awkward-specific validation tool: this can be left up to the library authors.

The caveat here is that I don't think we want this validation step to do anything too magical - ideally the mental model of behaviours as a mechanism to attach methods to classes holds.

agoose77 avatar May 30 '22 14:05 agoose77

This is a good idea—I've run into the same issue when constructing MomentumVector4D and a field like "px" is missing or a non-scalar type. (The latter can happen when using an interference you expect to zip that doesn't actually zip.)

Some things to think about:

  • If it's called on every ak.Array subclass constructor, it needs to be pretty lightweight. We'll need to communicate with library developers that this is not the place for O(n) tests of array contents. If there's any way to ensure that it's only called for user input, not any intermediate arrays, we should consider it. (Maybe direct constructors call it but ak._v2._util.wrap avoids it?)
  • It could be a dunder method, but it would be more in keeping with the way we do things if it's in the ak.behaviors dict.
  • Maybe it could be not automatic at all, but only invoked in ak.validity_error?
  • If it's defined in ak.behaviors instead of a dunder method, it could be invoked even if the target type is buried deep in an array.

Since @Saransh-cpp is doing a lot of work on Vector right now, it's the perfect time to coordinate and ensure that it's usable in this primary use-case.

jpivarski avatar May 30 '22 21:05 jpivarski

Just making notes - it would also be handy to have this for vector in order to prevent users setting multiple aliases for e.g. time-like quantities.

More notes!

I've thought about this further, and I like the idea of this being an ak.behavior feature. My rationale is that ak.behavior is a more general mechanism than behavior classes, and importantly we don't pollute the class namespace.

We already use the arrayclass mechanism for nested layouts, e.g. in to_list, but I feel that this should be an exception given the relevance of __getitem__ rather than a rule.

agoose77 avatar Jul 11 '22 08:07 agoose77