rake
rake copied to clipboard
Rake fails to re-run file task if file content changes but has timestamp older than target
I am not sure if this has been debated before.
Most of the modern build tools these days (Bazel, Buck etc.) rely on file content changing rather than timestamps. That makes them suitable for use in caching schemes to make builds incremental and reduce build times.
Rake on the other hand, relies solely on timestamps and does not fit in well with caching schemes.
For example, say an XML file is a prerequisite/input to generate source files. If the XML file changes but still has an older timestamp than the source files (entirely possible in caching schemes), the source files are not re-generated. This causes builds to fail / or (dangerously) incorrectly pass.
Is there any plan to make Rake on par with the more modern build tools of today?
Rake was designed to have the behavior of make
but be written in Ruby. Changing from timestamp to checksum could break behavior of existing rake workflows so I find it very difficult to justify changing the behavior of Rake::FileTask#needed?
.
However, it is possible to implement such a system in rake
. A new task type such as Rake::ChecksumedFileTask
could be added and #needed?
implemented on it to compare the file's current checksum with a checksum stored from the last run.
You should be able to write this as a Rake extension and adjust it to meet your needs.
Good idea on implementing this as a Rake extension. I am keen to work on this and put it up as a PR.
Any ideas on how this can be implemented so that the performance is on par with checking timestamps?
Hm, where is a good place to store / check the old checksum?
Good question, what about an internal database within Rake? Bad idea?
I really like this idea but I think it's outside the scope of rake -- due to needing some kind of persistent storage for such metadata, presumably on the filesystem. Not that it can't be implemented with e.g. ChecksumFileTask, but I think it makes more sense as a separate gem, not part of the rake core. Just a thought -- I haven't paid much attention to rake internals until very recently.
Here's a first stab, without attempting to solve the "old checksum" problem yet: https://gist.github.com/rickhull/f93f8114836b7a8ded767041a54cef1a
# frozen_string_literal: true
require 'rake/task'
require 'digest'
module Rake
class ChecksumTask < Task
# satisfying an API here; scope is purposely ignored
def self.scope_name(_scope, task_name)
Rake.from_pathname(task_name)
end
def checksum
# Digest::SHA256, etc etc
Digest::MD5.file(name).hexdigest
end
def needed?
!File.exist?(name) or
new_checksum?(checksum) or
@application.options.build_all
end
def new_checksum?(sum)
# retrieve old checksum and compare
end
end
end
For storing and retrieving the old checksums, I think a YAML file in the project root makes sense. Something like .checksums.yaml
with a hash inside keyed by relative file paths. Pstore, YAML::Store, or https://github.com/rickhull/dotcfg could be used for this.
Great, thanks @rickhull ! You think a YAML file will be fast enough when it comes to read access times?
A performance comparison of ChecksumTask
vs FileTask
would be a good starting point.
@shreyasbharath yes, it should be plenty fast enough. The YAML file should be read once at the beginning of the process, then the Hash of filenames to checksums would be checked and updated in memory, then the YAML file written once at the end of the process. I would guess we are talking under a thousandth of a second in terms of YAML / filesystem overhead per process.
Spurred to create https://github.com/rickhull/dotcfg/blob/master/test/bench/dotcfg.rb
Output from a single CPU vagrant VM:
$ rake bench
Run options: --seed 27105
# Running:
.........
Finished in 0.008236s, 1092.7685 runs/s, 1578.4434 assertions/s.
9 runs, 13 assertions, 0 failures, 0 errors, 0 skips
Warming up --------------------------------------
write /tmp/dotcfg-267819642
390.000 i/100ms
rewrite /tmp/dotcfg-267819642
274.000 i/100ms
Calculating -------------------------------------
write /tmp/dotcfg-267819642
3.851k (± 7.5%) i/s - 19.110k in 5.004839s
rewrite /tmp/dotcfg-267819642
2.803k (± 3.6%) i/s - 14.248k in 5.091059s
Comparison:
write /tmp/dotcfg-267819642: 3851.5 i/s
rewrite /tmp/dotcfg-267819642: 2802.7 i/s - 1.37x slower
{"foo"=>36682}
3.8k writes per second; 2.8k rewrites per second. Implies the total overhead (aside from require
or DotCfg.new
) is well under a thousandth of a second per rake
invocation, if structured properly.
That's pretty good.
What about computation of the hash (on the file) itself? This will of course be slower than a simple timestamp comparison. I wonder how Bazel, Buck and the likes do it.
Trivial IMHO. I suspect they are using the same underlying MD5 / SHA digest libs, unless they've gotten to nth-degree optimization. But rake and ruby have larger inherent time sinks that should swamp such concerns. I would ignore the performance concerns for now. "fast enough" will almost certainly be achievable with Hash, Digest, and YAML
Thanks for your insights.
It sounds like you are keen to work on it 😄 ? I can give you a hand if required, although my Ruby skills aren't top notch.
My interest in this is just academic right now. I suggest you run with it, and I'll help out where I can.
I suggest:
- create a new github repo -- don't worry about making it a gem or anything. just a couple files and folders
- create lib/checksum_file_task.rb (or whatever name you prefer; seed it with my gist, if you like)
- create test/checksum_file_task.rb (as above; this file can be empty for now)
We can pick this up over there :)
Okay I will let you know when I set this up @rickhull 😄
Has there been any development on this one? I have a use case where the timestamp on input files aren't a reliable source for determining if the output needs updating. If this hasn't gone anywhere, what would my alternative be?