custodian Caching parsed files

In most of the handlers and validators the same output files are parsed in the check and/or in the correct methods. For VASP they are mainly the vasprun.xml and OUTCAR files. In most cases this is not a huge overhead, even though it is still a bit of a waste, but for calculations with large output data (e.g. MD) the multiple parsing could have an impact.

I think it would be beneficial to modify Custodian and the handlers in such a way that the parsed data could be cached and shared among the handlers.

If this could be interesting I can draft an implementation.
One way of doing this could be by creating a dictionary in Custodian._runjob() and setting it as a cached_data attribute in all the handlers. Every handler could then check if a parsed version of the data is already present in the dictionary and otherwise set it. For example in a handler one could have something like:

cached_data = getattr(self, "cached_data", None)
if cached_data and "Vasprun" in cached_data:
    vasprun = cached_data["Vasprun"]
else:
    vasprun = Vasprun("vasprun.xml")
    if cached_data is not None:
        cached_data["Vasprun"] = vasprun

Any thoughts about this? Would you consider merging a PR implementing this functionality?

Jul 06 '23 15:07 gpetretto

I think this is reasonable, though there are a few concerns.

Some of the files are changed over the course of a run. In essence, a cached file will miss those changes, esp for handlers that deal with on the fly monitoring. So we have to be careful where we apply this to.
I would actually opt for a lru implementation, e.g.,

@lru
def load_file(filename):
    ...
    return contents

Jul 06 '23 23:07 shyuep

Hi @shyuep, thanks for looking into this.

Indeed the first point is one to be careful about. My idea was to clear the cached_data dictionary in the _do_check method, after all the handlers have been iterated on:

    def _do_check(self, handlers, terminate_func=None):
        corrections = []
        for h in handlers:
            ...
        self.cached_data.clear()
        ...

Using the lru_cache would not allow to keep track of which functions have been cached from the Custodian object and prevent from clearing the cache. One option to keep both functionalities would be to implement an additional wrapper to the @lru_cache, that will also save the set of cached functions, allowing Custodian to clear all the caches when needed. Here a possible implementation:

class tracked_lru_cache:
    cached_functions = set()
    def __init__(self, func):
        self.func = functools.lru_cache()(func)
        functools.update_wrapper(self, func)

    def __call__(self, *args, **kwargs):
        result = self.func(*args, **kwargs)
        self.cached_functions.add(self.func)
        return result

    @classmethod
    def cache_clear(cls):
        while cls.cached_functions:
            f = cls.cached_functions.pop()
            f.cache_clear()

However, let me broaden the scope of the discussion. The main point was not strictly the caching, but rather optimizing the time spent parsing the output files. Caching seems a relatively easy way to reduce the parsing time by a factor N < 10. In contrast I have seen that in most of the cases the full output files (vasprun.xml, outcar) are parsed just to extract a single information. From a quick test, cases like

curr_drift = outcar.data.get("drift", [])[::-1][: self.to_average]

Could be sped up by a factor >100 if the full parsing of the OUTCAR is replaced by the search with regrep. This could be done without code repetition if the regex used to parse the outputs are exposed by the classes in pymatgen.io.vasp.outputs, while at the moment they are buried inside the code. This will also reduce the memory footprint of the parsing, which could be quite beneficial, especially for the monitors, where the DFT code is still being executed. In principle, both solutions could be implemented to fully optimize the parsing, since a single information is often used across different handlers (e.g. the Vasprun.converged property).

Jul 07 '23 09:07 gpetretto

I agree with using a regex. I am using the handlers to post-process, and I cannot include the ones that read the OUTCAR (DriftErrorHandler, LargeSigmaHandler) because they are too slow.

Jul 12 '23 19:07 xivh

custodian custodian copied to clipboard

Caching parsed files

custodian
custodian copied to clipboard