DataPackageR icon indicating copy to clipboard operation
DataPackageR copied to clipboard

Should dataobjects be read in using different functions?

Open mariev opened this issue 3 years ago • 0 comments

In the scenario where multiple scripts are listed in datapackage.yml there are two options for accessing objects created via scripts earlier in the list:

  1. datapackager_object_read, which is for accessing objects that were run in the same build i.e. both Rmd files are toggled as enabled: yes
  2. project_data_path, which allows loading an .rda file created in a previous iteration of package_build()

This creates a relationship between the two scripts that requires manual updates when rebuilding package. Assuming the case of two processing scripts, preprocess_A and preprocess_B which generate A.rda and B.rda, respectively. preprocess_B uses the output from preprocess_A,

In the following build scenario, we would use datapackager_object_read:

# Case 1
files:
  preprocess_A.Rmd:
    enabled: yes
  preprocess_B.Rmd:
    enabled: yes

In a subsequent build that is of type 2, we have to update preprocess_B to use project_data_path:

# Case 2
files:
  preprocess_A.Rmd:
    enabled: no
  preprocess_B.Rmd:
    enabled: yes

There is a certain logic to this update, because it is a change of state in preprocess_B, to no longer be coupled with preprocess_A.

However, if preprocess_A needs to be rerun for some reason, we have to take the following action:

  • update datapackager.yml to enable both files
  • switch preprocess_B.Rmd to use datapackager_object_read again (not especially intuitive)

Wondering if its possible that data objects are always read from the /data/ location, but after any previous scripts have written to that folder? This would enforce that the latest data is always used, while maximizing code portability.

mariev avatar Apr 28 '21 02:04 mariev