Data is not rebuilt in simple example.
I have remake.yml
sources:
- code.R
targets:
all:
depends: plot.pdf
data.csv:
command: download_data()
processed:
command: process_data("data.csv")
plot.pdf:
command: myplot(processed)
plot: true
and code.R
download_data = function(){
data(mtcars)
write.csv(mtcars, "data.csv")
}
process_data = function(file){
read.csv(file)
}
myplot = function(d){
plot(cyl ~ mpg, data = d)
}
First build and second builds looks okay.
> remake::make()
[ LOAD ]
[ READ ] | # loading sources
< MAKE > all
[ BUILD ] data.csv | download_data()
[ READ ] | # loading packages
[ BUILD ] processed | processed <- process_data("data.csv")
[ PLOT ] plot.pdf | myplot(processed) # ==> plot.pdf
[ ----- ] all
> remake::make()
[ LOAD ]
[ READ ] | # loading sources
< MAKE > all
[ OK ] data.csv
[ OK ] processed
[ OK ] plot.pdf
[ ----- ] all
However, if I leave the code alone and modify data.csv by hand, I get
[ LOAD ]
[ READ ] | # loading sources
< MAKE > all
[ OK ] data.csv
[ BUILD ] processed | processed <- process_data("data.csv")
[ READ ] | # loading packages
[ PLOT ] plot.pdf | myplot(processed) # ==> plot.pdf
[ ----- ] all
The plot is updated, but data.csv is no longer matches the contents of download_data.
I see. This is not something I'd really thought about to be honest; mostly I'd imagined that for files that remake controls users would leave them alone :grinning:
I could try and work around this, though it's not totally clear what the right thing to do is. download_data doesn't depend on data.csv so it seems odd to trigger rebuilding it when it changes for reasons that are external to remake.
One downside of adding in checks though is that it will involve computing the hash of every file target every time remake runs past one in the graph.
This is not an issue for object targets as they're out of the reach of all but the most determined user.
I'll think about this, but I'm not actively doing remake things at the moment (hopefully later in the year?).
I'm not exactly sure what the right thing to do is either. No rush on this, of course. Remake is already amazing.
Would it be enough to generate a warning if the timestamp on a programmatically-generated file is later than expected?
Hi Will!
Thanks for the prod on this. Unfortunately this and #51 are pretty subtle to get right and I am thinking about them at the moment. Hopefully there'll be movement again in the next couple of weeks.
I don't think that it's quite enough just to warn about timestamps because things like git make a real mess of timestamps and I'd rather focus on file content (this was one of the motivations for remake).
However, something like a warning could be good. The only issue is going to be efficiently detecting that the file has changed without too many rehashings. I did some work on the hash_files branch where we can efficiently check hashes within a session which might be a reasonable solution to not doing too much work.
The other bit to resolve is determining what is out of date here; do we
a. rerun download_data
b. warn about the inconsistency and do nothing
c. run the process_data step (and if so do we warn)
I'm generally not a big fan of warnings (preferring actual errors as warnings tend to lead to terrible UI, are hard to deal with and are usually actually errors) but in this case I wonder if it's the right move
I've thought about what you said, and I like what you did in store.R: only rehashing if quick heuristics like mtime change (hash_files branch). What about this?
- If mtime did not change, skip the file.
- If mtime changed, rehash the file.
- If the hash didn't change, maybe tell the user.
[ LOAD ]
[ READ ] | # loading sources
< MAKE > all
[ REHASH ] data.csv
[ BUILD ] processed | processed <- process_data("data.csv")
[ READ ] | # loading packages
[ PLOT ] plot.pdf | myplot(processed) # ==> plot.pdf
[ ----- ] all
- If the hash changed, recover/rebuild the file.
[ LOAD ]
[ READ ] | # loading sources
< MAKE > all
[ BUILD ] data.csv
[ BUILD ] processed | processed <- process_data("data.csv")
[ READ ] | # loading packages
[ PLOT ] plot.pdf | myplot(processed) # ==> plot.pdf
[ ----- ] all
One issue I see with printing [ REHASH ] is that it can't be printed until the potentially time-consuming rehashing is done: that is, unless you can remove [ REHASH ] from the console in cases when the file is rebuilt.
If you do choose the approach I mentioned on Dec 16, we need to deal with the imprecision of file.mtime(). On Mac and Windows, I can only get times to the nearest second with file.create("my_file.txt"); print(as.numeric(file.mtime("my_file.txt")), digits = 22). So maybe we force a rehash when we know we just wrote a file but the old mtime is still not older than the new mtime. Unless the user manually breaks the file in the same second that remake writes it, we should be covered.