uproot5 icon indicating copy to clipboard operation
uproot5 copied to clipboard

Slow "contains" lookup and error message reporting for files with large amounts of keys

Open kratsg opened this issue 3 years ago • 4 comments

There is a damerau_levenshtein function being called when we hit a missing key that takes a long time when the number of keys in a file is very large (https://github.com/scikit-hep/uproot4/blob/85f219a36e76dffc18da4756227a7beb760657a0/src/uproot/_util.py#L810-L858).

Similarly, potential_name in directory is also not the best: https://github.com/scikit-hep/uproot4/blob/85f219a36e76dffc18da4756227a7beb760657a0/src/uproot/reading.py#L1913-L1919

For now, we'll have to call directory.keys() and store that in a set for valid key lookups.

Originally posted by @kratsg in https://github.com/scikit-hep/pyhf/discussions/1687#discussioncomment-1621078

kratsg avatar Nov 10 '21 16:11 kratsg

You beat me to it, so here's the additional content I was writing:

Instead, ReadOnlyDirectory.__contains__ should

def __contains__(self, where):
    if not hasattr(self, "_cached_keys"):
        self._cached_keys = set(self.iterkeys()) | set(self.iterkeys(cycle=False))
    return where in self._cached_keys

Thus, any string where that can be used in __getitem__, which is in the keys() with cycle numbers or without cycle numbers, would return True from __contains__.

Then __contains__ would be the fast function we expect it would be.

jpivarski avatar Nov 10 '21 16:11 jpivarski

In addition, the damerau_levenshtein function should be skipped when there's a very large number of keys. I haven't looked at the implementation recently, but I'll bet it's O(n²).

jpivarski avatar Nov 10 '21 16:11 jpivarski

In pyhf where we have histograms in two possible paths (because of how HistFactory works) - avoiding the DeserializationError by comparing against a (cached) set of keys gives us a speedup from 20hrs down to ~10seconds.

kratsg avatar Nov 10 '21 17:11 kratsg

Error handling code, for errors that are supposed to propagate to the user, need not be fast. As long as the error gets raised on a human timescale (< 1 second), it's fine.

It's only a problem when errors are caught as non-exceptional cases (e.g. StopIteration). The fact that try-except allows us to use exceptions for these two very different purposes puts "Do I have enough information in the error message?" versus "Is this fast enough?" into conflict.

jpivarski avatar Nov 10 '21 17:11 jpivarski

This issue was fixed in #595

ioanaif avatar Feb 15 '24 15:02 ioanaif