schema
schema copied to clipboard
Recursive priorities
If we wanted to quash all priority issues, once and for all, we'd have to go one final, additional step from #54: rather than constructing Schema objects lazily, in validate(), we should traverse all the way down the hierarchy, turning things into Schema subclasses, in Schema.__init__
(or, more likely, in Schema.__new__
or some other factory function, so we can spit out different classes). Then we would have the complete compound priority, like (VALIDATOR, VALIDATOR, COMPARABLE)
for Optional(And(int, is_even))
, and any remaining problems with nondeterminism would be scared into a little corner where we can't decide which of, say, 2 callables has higher priority. If someone wanted to explicitly set priorities, they could write their own class with a priority() method, or we could even introduce a manual priority=
kwarg on Schema for those special, hard-to-reach cases—or for cases where people don't want to go by our concept of specificity.
Creating this issue as a place for discussion about this.
Yes :smile:
Yay! It'll probably even increase performance, since we'll construct all those Schema instances once and then can validate against them 10,000 times without doing it again.
Another factor that could be considered in priorities is the order of keys in the dict, if it were an OrderedDict. I wonder how much of the rest of this that makes unnecessary.
I haven't thought it through exhaustively, but I want to elevate this to the level of "proposal" and see what you think, @keleshev. What if we removed literal dicts and sets and replaced them with Dicts and Sets, similar schema-provided constructs that are ordered? Then priority could go away as a concept. This would mean…
- Less to teach users
- Easier-to-reason-about schemas—it would just try things in order until one matched
- No need for any
priority
args or method overrides to vary from our default idea of priority - A lot less code. We could even rip out a fair amount of what's currently in there.
Of course, it wouldn't be backward-compatible; you probably care more about that than I do. And I'd miss the dict-literal syntax. But for so much simplification, I find it compelling.
Yeah, compelling. Let me see how it looks
# now
Schema({
'shape': str,
Optional('color', default='blue'): str,
str: str,
})
# proposed
Dict(
('shape', str),
(Optional('color', default='blue'), str),
(str, str),
)
So I had shot at this issue. You can find the preliminary results in my recursive-priorities branch.
I decided to go with a factory function schema
that more or less replaces the old Schema
class.
Example:
In [2]: s = schema({Optional("a"): int, Optional(str): object})
In [3]: s.validate({"a": 3, "b": 1})
Out[3]: {'a': 3, 'b': 1}
In [4]: s
Out[4]: Dict([(Optional(Comparable('a')), Type(int)), (Optional(Type(str)), Type(object))])
The design isn't finished IMO and the test suite still contains a lot of failing tests, mostly due to repr
s turning out differently now.
I just wanted to hear what you think about it before I adjust all the error messages in the test suite. ;)
Can someone fill me in? What are the priority problems? An example would be especially helpful, as I've never run into these problems.
I don't have time for a full response right now, but take a look at #23, especially this comment.
I see, thank you. Your example doesn't really make it clear, as an int is also an object, so we don't know which of the two matched. I'll have a look at your code to see how you do this, but it seems like a welcome change. It does throw a spanner in the works for the code I've been working on, which is self-generating documentation from a schema, but that's life.
Any performance gains would be highly beneficial to dvc. Now validating 20k yaml files like:
md5: 09f910d6eb6b6013d6fd292e9b7b2884
outs:
- md5: 038a480057ac11cb3ba68e35d29e044f
path: 1.txt
cache: true
metric: false
persist: false
takes ~10s for me.