repro What is the 'entry point' for a project?

It's also important to consider that the clean_data file might not be the entry point to the project for secondary analyses, so I also add a run_me.R or simply the Rmd as primary entry point. Originally posted by @cjvanlissa in https://github.com/aaronpeikert/reproducible-research/pull/29#issuecomment-629141197

I think this is something we should discuss together as we have different solutions to that. For me the main entry point for every user whether they want to replicate or extend an analysis is:

make all

May 15 '20 09:05 aaronpeikert

I hope that I may convince you (@cjvanlissa) to automatically include a Makefile (because the strict structure of a worcs project allows to infer this). Whether a user actually makes use of the Makefile is another question.

May 15 '20 09:05 aaronpeikert

@aaronpeikert I'm already convinced, I just don't have the know-how :)

What does the makefile do if people don't have access to the original data?

May 15 '20 10:05 cjvanlissa

Great! My idea was to use worcs::load_data() and write it to a data file. But I have to admit that I don't know from the top of my head how WORCS saves the files etc.

Either one doesn't include the dependency on the closed data file or we catch the case of nonexistent files. The first solution wouldn't detect changes on the original data, but has a simpler Makefile. While the second solution has a more complicated makefile, much of the complexity can be hidden in worcs R functions.

May 15 '20 10:05 aaronpeikert

Yeah, worcs::load_data() would work; I think it would be elegant if we allow for different data formats than .csv in that case... although I still prefer to stick with text based only

May 15 '20 10:05 cjvanlissa

I don't know how you feel, but I love opinionated solutions, till I hate that they restrict me somewhere were it hurts. I think of worcs as the opinionated version of repro. worcs is easier to use and to understand because of the restrictions that come with it. And personally I love worcs for just that (because you strike such a good balance)! And the "one central csv file" restriction is an example of it. I think it is a very sensible restriction and for most of the researchers, that is helpful. They don't have to weigh the options, because you thought about it for them.

However, it will frustrate some people. These people who want more should use repro. And because we have joint forces it will be super easy for them to extend worcs with repro to match their demands.

May 15 '20 10:05 aaronpeikert

BTW I wrote it like that: https://github.com/aaronpeikert/repro/blob/66a8c07221b0802c19fd7f170c7235db51b0777b/R/automate.R#L196-L201 So users may use any read* command they like.

May 15 '20 10:05 aaronpeikert

Yes. Would it make sense to try to guess func based on data if the argument is NULL?

May 15 '20 10:05 cjvanlissa

Or do that, AND store the func used for saving to make sure we get the settings right?

May 15 '20 10:05 cjvanlissa

Give an error message if it fails , like:

warning("The file ", data, " was saved using the function call: ", yaml_repro_current()$data$save_func)

May 15 '20 10:05 cjvanlissa

I am hesitant with the guessing (but could imagine using rio::import() for the guessing), but I like the idea of storing it, how would you integrate it in the metadata of the Rmd?

---
title: "Test2"
author: "Aaron Peikert"
date: "1/13/2020"
output: html_document
repro:
  packages:
    - lubridate
    - readr
  data:
    - mtcars.csv
  scripts:
    - analyze.R
    - plots.R
---

May 15 '20 10:05 aaronpeikert

I would generally be very interested in hearing your ideas about how to integrate your automated data workflow into this metadata (closed, open, synthetic).

May 15 '20 10:05 aaronpeikert

Have you seen how I solved it in the CRAN submission? For every real data file, store the file name. As a subheader, store the file name of its synthetic counterpart (here you could put the save function too).

Separate heading for all checksums, regardless of whether they are for original or synthetic data, or any other file

May 15 '20 10:05 cjvanlissa

But you have it a csv right? So a YAML counterpart would look like:

repro:
  data:
    mtcars.csv:
      checksum: "aslödfjasdlk"
      synthetic: mtcars_synth.csv
      read_fun: readr::read_csv
      write_fun: readr::write_csv

May 15 '20 11:05 aaronpeikert

Not really pretty, but as everything is optional it might be a good solution.

May 15 '20 11:05 aaronpeikert

Aww the only problem is that it has to be repeated for every rmd, so we should have an option like:

repro:
  data_config: data_config.csv

and/or

repro:
  data_config: data_config.yml

May 15 '20 11:05 aaronpeikert

@aaronpeikert I'm inclined to write a wrapper for this, which does the following:

When users call save_data(), they provide arguments like save_data(data, save_func, load_func)
save_func and load_func are written to the yaml
Check that running load_func on the saved data results in an identical object as data

When running load_data(), load_func is retrieved from the yaml

May 20 '20 09:05 cjvanlissa

repro repro copied to clipboard

What is the 'entry point' for a project?

repro
repro copied to clipboard