doit icon indicating copy to clipboard operation
doit copied to clipboard

Update file dependencies for up-to-date tasks.

Open tillahoffmann opened this issue 2 years ago • 0 comments

This PR updates file dependencies in the doit database even if the task is already up to date. The change improves performance for large files under certain circumstances.

Consider the following task which simply copies large_file.txt to output.txt.

def task_copy():
    return {
        "actions": ["cp large_file.txt output.txt"],
        "targets": ["output.txt"],
        "file_dep": ["large_file.txt"],
    }

The first time doit runs, it saves the timestamp, size, and md5 hash. On the second run, doit smartly skips calculating the md5 hash of large_file.txt because the timestamps match. So far so good.

Now suppose the timestamp changes but the content does not. This might happen if we delete an intermediate file which is then regenerated. On the second run, doit will evaluate the md5 on large_file.txt and skip the task because it's up to date--as expected. But it won't update the timestamp in the database. So every time we run doit, it'll evaluate the md5 hash of large_file.txt.

This PR ensures the file dependencies are updated in the database even if the task is already up to date. Here's a concrete example using touch to update the timestamp. I've modified the check_modified function to report some debugging information (see end of description for details).

$ (master) rm -f .doit.db  # Start clean.
$ (master) doit
.  copy
$ (master) doit
-- copy
$ (master) touch large_file.txt
$ (master) doit
large_file.txt was modified at 15:53:09.664308; expected 15:51:36.076443
-- copy
$ (master) doit  # Evaluates md5 hash again (and will indefinitely).
large_file.txt was modified at 15:53:09.664308; expected 15:51:36.076443
-- copy
$ (check_modified) rm -f .doit.db  # Start clean.
$ (check_modified) doit
.  copy
$ (check_modified) doit
-- copy
$ (check_modified) touch large_file.txt
$ (check_modified) doit
large_file.txt was modified at 15:51:36.076443; expected 15:49:30.170537
-- copy
$ (check_modified) doit  # Does not evaluate md5 hash again (updated timestamp saved in previous run).
-- copy

Updated check_modified to report debug information.

    def check_modified(self, file_path, file_stat, state):
        """
        Check if file in file_path is modified from previous "state".
        """
        timestamp, size, file_md5 = state

        # 1 - if timestamp is not modified file is the same
        if file_stat.st_mtime == timestamp:
            return False

        from datetime import datetime
        print(f"{file_path} was modified at {datetime.fromtimestamp(file_stat.st_mtime).time()}; "
              f"expected {datetime.fromtimestamp(timestamp).time()}")

        # 2 - if size is different file is modified
        if file_stat.st_size != size:
            return True

        # 3 - check md5
        return file_md5 != get_file_md5(file_path)

tillahoffmann avatar Jul 01 '22 19:07 tillahoffmann