schema icon indicating copy to clipboard operation
schema copied to clipboard

Recursive priorities

Open erikrose opened this issue 9 years ago • 10 comments

If we wanted to quash all priority issues, once and for all, we'd have to go one final, additional step from #54: rather than constructing Schema objects lazily, in validate(), we should traverse all the way down the hierarchy, turning things into Schema subclasses, in Schema.__init__ (or, more likely, in Schema.__new__ or some other factory function, so we can spit out different classes). Then we would have the complete compound priority, like (VALIDATOR, VALIDATOR, COMPARABLE) for Optional(And(int, is_even)), and any remaining problems with nondeterminism would be scared into a little corner where we can't decide which of, say, 2 callables has higher priority. If someone wanted to explicitly set priorities, they could write their own class with a priority() method, or we could even introduce a manual priority= kwarg on Schema for those special, hard-to-reach cases—or for cases where people don't want to go by our concept of specificity.

Creating this issue as a place for discussion about this.

erikrose avatar Feb 27 '15 17:02 erikrose

Yes :smile:

keleshev avatar Mar 04 '15 13:03 keleshev

Yay! It'll probably even increase performance, since we'll construct all those Schema instances once and then can validate against them 10,000 times without doing it again.

erikrose avatar Mar 04 '15 15:03 erikrose

Another factor that could be considered in priorities is the order of keys in the dict, if it were an OrderedDict. I wonder how much of the rest of this that makes unnecessary.

erikrose avatar Mar 05 '15 17:03 erikrose

I haven't thought it through exhaustively, but I want to elevate this to the level of "proposal" and see what you think, @keleshev. What if we removed literal dicts and sets and replaced them with Dicts and Sets, similar schema-provided constructs that are ordered? Then priority could go away as a concept. This would mean…

  • Less to teach users
  • Easier-to-reason-about schemas—it would just try things in order until one matched
  • No need for any priority args or method overrides to vary from our default idea of priority
  • A lot less code. We could even rip out a fair amount of what's currently in there.

Of course, it wouldn't be backward-compatible; you probably care more about that than I do. And I'd miss the dict-literal syntax. But for so much simplification, I find it compelling.

erikrose avatar Mar 09 '15 23:03 erikrose

Yeah, compelling. Let me see how it looks

# now
Schema({
    'shape': str,
    Optional('color', default='blue'): str,
    str: str,
})

# proposed
Dict(
    ('shape', str),
    (Optional('color', default='blue'), str),
    (str, str),
)

keleshev avatar Mar 10 '15 11:03 keleshev

So I had shot at this issue. You can find the preliminary results in my recursive-priorities branch.

I decided to go with a factory function schema that more or less replaces the old Schema class.

Example:

In [2]: s = schema({Optional("a"): int, Optional(str): object})

In [3]: s.validate({"a": 3, "b": 1})
Out[3]: {'a': 3, 'b': 1}

In [4]: s
Out[4]: Dict([(Optional(Comparable('a')), Type(int)), (Optional(Type(str)), Type(object))])

The design isn't finished IMO and the test suite still contains a lot of failing tests, mostly due to reprs turning out differently now.

I just wanted to hear what you think about it before I adjust all the error messages in the test suite. ;)

sjakobi avatar Oct 06 '15 06:10 sjakobi

Can someone fill me in? What are the priority problems? An example would be especially helpful, as I've never run into these problems.

skorokithakis avatar Oct 06 '15 11:10 skorokithakis

I don't have time for a full response right now, but take a look at #23, especially this comment.

sjakobi avatar Oct 06 '15 12:10 sjakobi

I see, thank you. Your example doesn't really make it clear, as an int is also an object, so we don't know which of the two matched. I'll have a look at your code to see how you do this, but it seems like a welcome change. It does throw a spanner in the works for the code I've been working on, which is self-generating documentation from a schema, but that's life.

skorokithakis avatar Oct 06 '15 13:10 skorokithakis

Any performance gains would be highly beneficial to dvc. Now validating 20k yaml files like:

md5: 09f910d6eb6b6013d6fd292e9b7b2884
outs:
- md5: 038a480057ac11cb3ba68e35d29e044f
  path: 1.txt
  cache: true
  metric: false
  persist: false

takes ~10s for me.

Suor avatar Nov 03 '19 16:11 Suor