rake icon indicating copy to clipboard operation
rake copied to clipboard

Rake fails to re-run file task if file content changes but has timestamp older than target

Open shreyasbharath opened this issue 7 years ago • 15 comments

I am not sure if this has been debated before.

Most of the modern build tools these days (Bazel, Buck etc.) rely on file content changing rather than timestamps. That makes them suitable for use in caching schemes to make builds incremental and reduce build times.

Rake on the other hand, relies solely on timestamps and does not fit in well with caching schemes.

For example, say an XML file is a prerequisite/input to generate source files. If the XML file changes but still has an older timestamp than the source files (entirely possible in caching schemes), the source files are not re-generated. This causes builds to fail / or (dangerously) incorrectly pass.

Is there any plan to make Rake on par with the more modern build tools of today?

shreyasbharath avatar Jul 12 '17 02:07 shreyasbharath

Rake was designed to have the behavior of make but be written in Ruby. Changing from timestamp to checksum could break behavior of existing rake workflows so I find it very difficult to justify changing the behavior of Rake::FileTask#needed?.

However, it is possible to implement such a system in rake. A new task type such as Rake::ChecksumedFileTask could be added and #needed? implemented on it to compare the file's current checksum with a checksum stored from the last run.

You should be able to write this as a Rake extension and adjust it to meet your needs.

drbrain avatar Jul 12 '17 17:07 drbrain

Good idea on implementing this as a Rake extension. I am keen to work on this and put it up as a PR.

Any ideas on how this can be implemented so that the performance is on par with checking timestamps?

shreyasbharath avatar Jul 13 '17 04:07 shreyasbharath

Hm, where is a good place to store / check the old checksum?

rickhull avatar Nov 05 '17 22:11 rickhull

Good question, what about an internal database within Rake? Bad idea?

shreyasbharath avatar Nov 06 '17 06:11 shreyasbharath

I really like this idea but I think it's outside the scope of rake -- due to needing some kind of persistent storage for such metadata, presumably on the filesystem. Not that it can't be implemented with e.g. ChecksumFileTask, but I think it makes more sense as a separate gem, not part of the rake core. Just a thought -- I haven't paid much attention to rake internals until very recently.

rickhull avatar Nov 06 '17 06:11 rickhull

Here's a first stab, without attempting to solve the "old checksum" problem yet: https://gist.github.com/rickhull/f93f8114836b7a8ded767041a54cef1a

# frozen_string_literal: true
require 'rake/task'
require 'digest'

module Rake
  class ChecksumTask < Task
    # satisfying an API here; scope is purposely ignored
    def self.scope_name(_scope, task_name)
      Rake.from_pathname(task_name)
    end

    def checksum
      # Digest::SHA256, etc etc
      Digest::MD5.file(name).hexdigest
    end

    def needed?
      !File.exist?(name) or
        new_checksum?(checksum) or
        @application.options.build_all
    end

    def new_checksum?(sum)
      # retrieve old checksum and compare
    end
  end
end

For storing and retrieving the old checksums, I think a YAML file in the project root makes sense. Something like .checksums.yaml with a hash inside keyed by relative file paths. Pstore, YAML::Store, or https://github.com/rickhull/dotcfg could be used for this.

rickhull avatar Nov 06 '17 18:11 rickhull

Great, thanks @rickhull ! You think a YAML file will be fast enough when it comes to read access times?

A performance comparison of ChecksumTask vs FileTask would be a good starting point.

shreyasbharath avatar Nov 06 '17 22:11 shreyasbharath

@shreyasbharath yes, it should be plenty fast enough. The YAML file should be read once at the beginning of the process, then the Hash of filenames to checksums would be checked and updated in memory, then the YAML file written once at the end of the process. I would guess we are talking under a thousandth of a second in terms of YAML / filesystem overhead per process.

rickhull avatar Nov 06 '17 23:11 rickhull

Spurred to create https://github.com/rickhull/dotcfg/blob/master/test/bench/dotcfg.rb

Output from a single CPU vagrant VM:

$ rake bench
Run options: --seed 27105

# Running:

.........

Finished in 0.008236s, 1092.7685 runs/s, 1578.4434 assertions/s.

9 runs, 13 assertions, 0 failures, 0 errors, 0 skips
Warming up --------------------------------------
write /tmp/dotcfg-267819642
                       390.000  i/100ms
rewrite /tmp/dotcfg-267819642
                       274.000  i/100ms
Calculating -------------------------------------
write /tmp/dotcfg-267819642
                          3.851k (± 7.5%) i/s -     19.110k in   5.004839s
rewrite /tmp/dotcfg-267819642
                          2.803k (± 3.6%) i/s -     14.248k in   5.091059s

Comparison:
write /tmp/dotcfg-267819642:     3851.5 i/s
rewrite /tmp/dotcfg-267819642:     2802.7 i/s - 1.37x  slower

{"foo"=>36682}

3.8k writes per second; 2.8k rewrites per second. Implies the total overhead (aside from require or DotCfg.new) is well under a thousandth of a second per rake invocation, if structured properly.

rickhull avatar Nov 06 '17 23:11 rickhull

That's pretty good.

What about computation of the hash (on the file) itself? This will of course be slower than a simple timestamp comparison. I wonder how Bazel, Buck and the likes do it.

shreyasbharath avatar Nov 07 '17 01:11 shreyasbharath

Trivial IMHO. I suspect they are using the same underlying MD5 / SHA digest libs, unless they've gotten to nth-degree optimization. But rake and ruby have larger inherent time sinks that should swamp such concerns. I would ignore the performance concerns for now. "fast enough" will almost certainly be achievable with Hash, Digest, and YAML

rickhull avatar Nov 07 '17 02:11 rickhull

Thanks for your insights.

It sounds like you are keen to work on it 😄 ? I can give you a hand if required, although my Ruby skills aren't top notch.

shreyasbharath avatar Nov 07 '17 04:11 shreyasbharath

My interest in this is just academic right now. I suggest you run with it, and I'll help out where I can.

I suggest:

  1. create a new github repo -- don't worry about making it a gem or anything. just a couple files and folders
  2. create lib/checksum_file_task.rb (or whatever name you prefer; seed it with my gist, if you like)
  3. create test/checksum_file_task.rb (as above; this file can be empty for now)

We can pick this up over there :)

rickhull avatar Nov 07 '17 05:11 rickhull

Okay I will let you know when I set this up @rickhull 😄

shreyasbharath avatar Nov 07 '17 20:11 shreyasbharath

Has there been any development on this one? I have a use case where the timestamp on input files aren't a reliable source for determining if the output needs updating. If this hasn't gone anywhere, what would my alternative be?

KyleRAnderson avatar Apr 25 '21 23:04 KyleRAnderson