repro
repro copied to clipboard
What is the 'entry point' for a project?
It's also important to consider that the clean_data file might not be the entry point to the project for secondary analyses, so I also add a run_me.R or simply the Rmd as primary entry point. Originally posted by @cjvanlissa in https://github.com/aaronpeikert/reproducible-research/pull/29#issuecomment-629141197
I think this is something we should discuss together as we have different solutions to that. For me the main entry point for every user whether they want to replicate or extend an analysis is:
make all
I hope that I may convince you (@cjvanlissa) to automatically include a Makefile (because the strict structure of a worcs project allows to infer this). Whether a user actually makes use of the Makefile is another question.
@aaronpeikert I'm already convinced, I just don't have the know-how :)
What does the makefile do if people don't have access to the original data?
Great!
My idea was to use worcs::load_data() and write it to a data file. But I have to admit that I don't know from the top of my head how WORCS saves the files etc.
Either one doesn't include the dependency on the closed data file or we catch the case of nonexistent files. The first solution wouldn't detect changes on the original data, but has a simpler Makefile. While the second solution has a more complicated makefile, much of the complexity can be hidden in worcs R functions.
Yeah, worcs::load_data() would work; I think it would be elegant if we allow for different data formats than .csv in that case... although I still prefer to stick with text based only
I don't know how you feel, but I love opinionated solutions, till I hate that they restrict me somewhere were it hurts.
I think of worcs as the opinionated version of repro. worcs is easier to use and to understand because of the restrictions that come with it. And personally I love worcs for just that (because you strike such a good balance)!
And the "one central csv file" restriction is an example of it. I think it is a very sensible restriction and for most of the researchers, that is helpful. They don't have to weigh the options, because you thought about it for them.
However, it will frustrate some people. These people who want more should use repro. And because we have joint forces it will be super easy for them to extend worcs with repro to match their demands.
BTW I wrote it like that: https://github.com/aaronpeikert/repro/blob/66a8c07221b0802c19fd7f170c7235db51b0777b/R/automate.R#L196-L201 So users may use any read* command they like.
Yes. Would it make sense to try to guess func based on data if the argument is NULL?
Or do that, AND store the func used for saving to make sure we get the settings right?
Give an error message if it fails , like:
warning("The file ", data, " was saved using the function call: ", yaml_repro_current()$data$save_func)
I am hesitant with the guessing (but could imagine using rio::import() for the guessing), but I like the idea of storing it, how would you integrate it in the metadata of the Rmd?
---
title: "Test2"
author: "Aaron Peikert"
date: "1/13/2020"
output: html_document
repro:
packages:
- lubridate
- readr
data:
- mtcars.csv
scripts:
- analyze.R
- plots.R
---
I would generally be very interested in hearing your ideas about how to integrate your automated data workflow into this metadata (closed, open, synthetic).
Have you seen how I solved it in the CRAN submission? For every real data file, store the file name. As a subheader, store the file name of its synthetic counterpart (here you could put the save function too).
Separate heading for all checksums, regardless of whether they are for original or synthetic data, or any other file
But you have it a csv right? So a YAML counterpart would look like:
repro:
data:
mtcars.csv:
checksum: "aslödfjasdlk"
synthetic: mtcars_synth.csv
read_fun: readr::read_csv
write_fun: readr::write_csv
Not really pretty, but as everything is optional it might be a good solution.
Aww the only problem is that it has to be repeated for every rmd, so we should have an option like:
repro:
data_config: data_config.csv
and/or
repro:
data_config: data_config.yml
@aaronpeikert I'm inclined to write a wrapper for this, which does the following:
- When users call
save_data(), they provide arguments likesave_data(data, save_func, load_func) save_funcandload_funcare written to the yaml- Check that running
load_funcon the saved data results in an identical object asdata
When running load_data(), load_func is retrieved from the yaml