traits TraitSet, TraitList and TraitDict: validate _all

TraitSet, TraitList and TraitDict: validate _all_ incoming values?

Open mdickinson opened this issue 4 years ago • 28 comments

The current policy in TraitList, TraitSet and friends is that values that are going to be added to the list/set/whatever are passed through the item validator, while values being removed are not.

This makes some sense: one worry here is that we might do something like this, where mymodel is a HasTraits instance and my_set is a Set-based trait on mymodel:

element = find_matching_element(mymodel.my_set)
do_something_with(element)
mymodel.my_set.remove(element)

Here element has likely come from direct iteration over the TraitSet object, and so should be passed directly to the remove call without any transformation. If we passed element through the item_validator and the validation changed it in some way, so that the remove call then failed, that's clearly undesirable.

However, the logic needed to follow this policy in something like TraitSet.symmetric_difference gets really ugly, and the result that some of the elements of the r.h.s to TraitSet.symmetric_difference will be passed through the validator, while others won't, depending on the exact contents of the set being modified, seems like a recipe for confusion.

I think there may be a better/simpler way, namely:

assume and document that item_validators should be idempotent: i.e., that for an already validated item item, item_validator(item) == item. (Probably actually identical rather than just equal in practice, but equality is what matters for collections.) This is already true for all item validators that we care about, I think, including for coercing trait types like CStr and potentially value-modifying trait types like Float (consider integer inputs to Float, for example).
validate all incoming values to set/list/dict operations. (Credit to @midhun-pm, who originally suggested that we do this, while I originally shot the idea down.)

The interesting side-effect of the second point is that given a Set(Int) trait, an operation like my_set.remove("not an integer") would raise TraitError instead of the expected KeyError. The exception in this situation would likely represent a coding bug anyway (rather than being an expected error that people expected to handle), so I think this is a reasonable tradeoff for the extra simplicity and predictability of the TraitSet operations.

Apr 22 '20 10:04 mdickinson

validate all incoming values to set/list/dict operations.

Would that include __contains__ too?

Apr 22 '20 10:04 kitchoi

Would that include __contains__ too?

Yes, for simplicity: the rule would be that everything gets passed through the normalization / validation.

It's going to make some things work, and others not. But I think that's true no matter what we do. (For example, if x in myobj.myset is true, we'd expect myobj.myset.remove(x) not to raise.)

Apr 22 '20 10:04 mdickinson

Aargh. But __contains__ shouldn't raise anything at all, as a general rule, which was of course your point. :-(

Apr 22 '20 10:04 mdickinson

But I think we're okay: we still validate, but we don't pass on the TraitError: we catch it and return False.

Apr 22 '20 10:04 mdickinson

Yep. Catch it and return False would be fine. Just to clarify... for removing values, the proposal here is only about validating the elements provided, but not to transform them before removal, right? If that's true, I am seeing how the ugly block in TraitSet.symmetric_difference you quoted can be made simpler...

Apr 22 '20 10:04 kitchoi

I am seeing

I am sorry... I keep missing important negation. I meant to say "I am not seeing".

Apr 22 '20 10:04 kitchoi

the proposal here is only about validating the elements provided, but not to transform them before removal, right

I don't know how you can separate the two, given that item_validator is doing both validation and normalization/transformation. (Which is fine.)

But no, the proposal would be to validate and normalize, and remove the normalised elements.

Apr 22 '20 10:04 mdickinson

(BTW, I think "normalization" is a better way of thinking about the transformation that the validator is doing than "transformation", since it suggests that idempotence that we do have in practice.)

Apr 22 '20 10:04 mdickinson

So in other words, the first line of symmetric_difference_update would be:

values = {self.item_validator(value) for value in values}

and from that point on we wouldn't do anything more with the original values or the item validator.

Apr 22 '20 11:04 mdickinson

I see. Now I understand.

Just putting this on the table for consideration: Deprecate the use of trait types that do casting in containers such that something like this will raise: Dict(CStr).

Then we can compare the cost/feasibility of doing it and the cost/feasibility of prohibiting it...

Apr 22 '20 11:04 kitchoi

Deprecate the use of trait types that do casting in containers such that something like this will raise: Dict(CStr).

I'd be okay with that, but I think it doesn't solve this problem: we likely still want to be able to do List(Float) or perhaps Set(Float), and Float also potentially does value-changing conversions (e.g., from integers).

I'm fairly sure we want to keep supporting conversions of integers to floats.

IOW, I think that proposal is orthogonal to this one.

Apr 22 '20 11:04 mdickinson

we likely still want to be able to do List(Float) or perhaps Set(Float), and Float also potentially does value-changing conversions (e.g., from integers).

But Python handles that, so we should be okay?:

>>> x = [1.0, 2.0, 3.0]
>>> x.remove(1)
>>> x
[2.0, 3.0]
>>> x = set([1.0, 2.0, 3.0])
>>> x.remove(1)
>>> x
{2.0, 3.0}

Apr 22 '20 11:04 kitchoi

Those are the cases where the value doesn't change (i.e., the transformed value is still == to the old one). Here's an example where it does:

>>> class A(HasTraits):
...     foo = List(Float)
... 
>>> x = 10**23
>>> a = A()
>>> a.foo.append(x)
>>> a.foo.remove(x)  # expect this to work ...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mdickinson/Enthought/ETS/traits/traits/trait_list_object.py", line 716, in remove
    super().remove(value)
  File "/Users/mdickinson/Enthought/ETS/traits/traits/trait_list_object.py", line 395, in remove
    super().remove(value)
ValueError: list.remove(x): x not in list

Apr 22 '20 11:04 mdickinson

Here's one that I think is more plausible:

>>> from traits.api import *
>>> from pathlib import Path
>>> class A(HasTraits):
...     files = List(File)
... 
>>> p = Path("/Users/mdickinson/somefile.txt")
>>> a = A()
>>> a.files.append(p)
>>> a.files.remove(p)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mdickinson/Enthought/ETS/traits/traits/trait_list_object.py", line 716, in remove
    super().remove(value)
  File "/Users/mdickinson/Enthought/ETS/traits/traits/trait_list_object.py", line 395, in remove
    super().remove(value)
ValueError: list.remove(x): x not in list

Apr 22 '20 11:04 mdickinson

I am in favour of this proposal in its general sweep.

I think it should be explicitly documented that validation is expected to be idempotent up to equality (ie. validate(x) == validate(validate(x))) independently of what we decide to do about when validation occurs. And, of course, make sure that existing validators do in fact follow the rule :)

Separately, I think for containers we should validate everything and then do comparison, as every situation where there may be a difference of behaviour seems to me like it is a bug - the current behaviour is almost certainly not what the person who wrote the code expects. I think the new TraitList, TraitSet objects should do this.

The case where the sort of exception returned changes does worry me a bit, and so there may need to be discussion about whether we want to catch and transform some of these errors. Eg. for a = Set(Int) I would expect that either:

a.discard('foo') does not raise and a.remove('foo') raises a KeyError; or
both a.discard('foo') and a.remove('foo') raise TraitError

I can see the argument for both, but I feel like the first may be needed simply for backwards compatibility reasons. I think the heuristic can be "if validate(x) raises, then x is not in the collection" and proceed as if that were the case (either raising KeyError, or computing the result). Things like symmetric difference get a bit gnarly, but I think can be resolved: a ^= {'foo'} should raise a TraitError via the logic that 'foo' fails to validate so it is not in a so the symmetric difference means that we need to add it to a, but that fails the traits constraints, so it is a TraitError.

However we go, we should be able to document a simple logic that allows developers to understand and anticipate the behaviour.

Apr 22 '20 11:04 corranwebster

I updated the title to clarify that this isn't just about TraitSet.

Yes, I agree that a.discard("foo") should maintain the promise that discard never raises, mostly because we should maintain the equivalence between a LBYL check (if obj in my_set: my_set.remove(obj)) and the equivalent my_set.discard(obj), and we definitely don't want the containment check to raise.

It's not 100% clear cut, though: if someone were attempting to discard something that's actually of the wrong type, then that's likely a coding error somewhere, but still.

Apr 22 '20 12:04 mdickinson

BTW, this is tentatively in 6.1.0, partly just because I think we should discuss it before the release, but I don't think any changes necessarily have to happen in 6.1.0.

OTOH, I want to avoid the death-by-a-thousand-cuts situation where the behaviour changes in subtle ways multiple times across releases.

Apr 22 '20 12:04 mdickinson

From my own experience of using traits, I would quite readily accept that this raises:

>>> class A(HasTraits):
...     files = List(File)
... 
>>> p = Path("/Users/mdickinson/somefile.txt")
>>> a = A()
>>> a.files.append(p)
>>> a.files.remove(p)

Because if I try to print a.files, I can immediately see that the value in the list is a string, and I understand the transformation that happens at the time of insertion. In fact, most of the time I don't even remember the list on an instance of HasTraits is not just an ordinary list. Doing normalization for __contains__ or similar logic will repaint that mental model.

Apr 22 '20 12:04 kitchoi

most of the time I don't even remember the list on an instance of HasTraits is not just an ordinary list

I think that's exactly the problem, though: we should make it possible for users to not have to think about the specialness (especially since some of those users could be third-party code that thinks it's just handling a regular list). And a user would expect that after you append something to a list my_list, then something in my_list would return True. At the moment, we break that invariant. I think this proposal brings TraitList closer to list, reduces the number of surprises like this, and makes it easier for people not to have to care that they have a TraitList rather than a plain old list.

Apr 22 '20 12:04 mdickinson

And for sets in particular, there are going to be algorithms (e.g., graph traversal algorithms) that rely on x in s being true after an s.add(x).

Apr 22 '20 12:04 mdickinson

But as a user, I would also be surprised to see Path("/Users/mdickinson/somefile.txt") in a.files returns True, while I clearly see a string in the list. To be honest, I find the current behaviour easier to understand.

Apr 22 '20 12:04 kitchoi

But most usage is going to be in code, not at the command line: there will be no-one to "see" a string in the list.

I think we have a genuine practical problem here. Suppose you have a List(File) trait. Perhaps it's a list of DNA sequence files currently selected for processing; perhaps it's a list of files that are known to be open and need to be closed during cleanup.

Suppose that, for whatever reason, you're given a new file to add to that list, and that file comes in the form of a pathlib.Path instance p. At some point in your code, you've done myobj.dna_files.append(p) (perhaps the user clicked on a corresponding checkbox, or selected it from a file dialog, or ...).

How do you then programmatically undo that operation, given that myobj.dna_files.remove(p) won't work?

Apr 22 '20 13:04 mdickinson

I think that surprise would also occur while this usage goes into code, and manual inspection is often the tool people would use for debugging. Let's say the file actually comes in the form of string and we have normalization for __contains__:

>>> a.files = ["/Users/name/file"]
>>> Path("/Users/name/file") in a.files
True
>>> any(path == Path("/Users/name/file") for path in a.files)
False

Wouldn't that be surprising?

How do you then programmatically undo that operation, given that myobj.dna_files.remove(p) won't work?

Then I would need to convince traits not to mess with my data when they go in, maybe like this:

class A(HasTraits):
    files = List(Union(Instance(Path), File))

Or even just this if I only need to deal with Path.

class A(HasTraits):
    files = List(Instance(Path))

Apr 22 '20 13:04 kitchoi

Wouldn't that be surprising?

Yes, it would. I don't think anything's going to remove all the surprises. But I think we can rid of most this way, as well as achieve conceptual simplicity (making the behaviour fit better into people's brains).

I wouldn't want to end up recommending that people write List(Union(Instance(Path), File)); it seems like a smell that such a complicated solution would be necessary. We want the obvious thing, List(File) to just work, as far as that's possible.

Perhaps we should remove the understanding of path-like objects from File? Or introduce CFile for such things, and then disallow it in containers?

Apr 22 '20 13:04 mdickinson

I am all in for not doing normalization at all for containers and recommend this pattern. We can have a strict version for all the trait types that current does normalization, e.g. we can have StrictFile that only does validation without messing with the value.

File might need to stay the way it is for backward compatibility.

Apr 22 '20 13:04 kitchoi

File might need to stay the way it is for backward compatibility.

~Oh no File is not released yet. :)~ Nope, it has been there, I don't know what I am doing.

Apr 22 '20 13:04 kitchoi

I am all in for not doing normalization at all for containers

I don't think that's an option, given that things like Int, Float and Bool normalise.

Apr 22 '20 14:04 mdickinson

Deferring the discussion to 6.2.0

Apr 22 '20 14:04 mdickinson

traits traits copied to clipboard

TraitSet, TraitList and TraitDict: validate _all_ incoming values?

traits
traits copied to clipboard