Replace stateful samples.Rdata logging with structured manifest file (runs_manifest.csv)
Currently run.write.configs acts as a "stateful" logger, appending run information to samples.Rdata. As we scale to hundreds or thousands of sites, this approach becomes inefficient and tightly couples the configuration step with the analysis step.
This issue proposes refactoring the downstream analysis functions ( e.g. read.ensemble.output, read.sa.output , etc) to adopt a stateless design that reads from a manifest, removing the dependency on runtime mutation of samples.Rdata.
run.write.configs generates run ids (e.g. ENS-0001-siteID) and physically saves them into the runs.samples list within samples.Rdata.
The samples.Rdata file grows linearly with the number of sites. The analysis modules depend on this file being constantly updated/mutated by the write step.
Proposed workaround :
Refactor the workflow to decouple "parameter definition" from "execution logging" by introducing a lightweight, Manifest file.
-
samples.Rdata becomes static treat samples.Rdata strictly as a "Master parameter definition" file (generated upstream by
get.parameter.samples). It should be immutable during the run.write.configs step. -
Introduce runs_manifest.csv instead of modifying the RData file, run.write.configs will generate a structured CSV file in the output directory. This file explicitly maps run ids to their design parameters.
proposed structure (runs_manifest.csv):
| run_id | site_id | pft_name | trait | quantile | type |
|---|---|---|---|---|---|
| ENS-0001-siteA | siteA | NA | NA | NA | Ensemble |
| SA-median-siteA | siteA | grass | NA | 0.5 | Sensitivity |
| SA-pft_name-T1-Q1-siteA | siteA | grass | SLA | 0.158 | Sensitivity |
- update downstream analysis functions (read.ensemble.output, read.sa.output, etc..) to read this CSV manifest.
logic: look up the
run_idwheresite_id == Xandtrait == Y
I agree with the first part of the issue, but I'm not sure I agree that the solution is to generate the run ids post hoc. That idea will end up sensitive to the naming scheme, which has changed in the past and could easily change in the future. My suggestion was that this info could be in a new file.
That makes sense, relying on string construction logic 'post hoc' does create a hidden dependency on the folder naming convention, and as you said which creates debt if we ever want to rename folders.
how about run.write.configs writes a lightweight, structured file (e.g. run_manifest.csv ) containing the mapping of runid <-> attributes (site, trait, quantile). Downstream analysis functions simply read this CSV to look up the correct run id for a given trait/quantile, rather than parsing samples.Rdata or reconstructing names via logic. So we can solve sample.Rdata mutation issue and keeping run id lookup explicit ?
I like this idea. It's also more transparent than stuffing numerous distinct dataframes into one RData object, which is what we're doing now.
Thanks!, I have updated the issue