pecan icon indicating copy to clipboard operation
pecan copied to clipboard

Replace stateful samples.Rdata logging with structured manifest file (runs_manifest.csv)

Open divine7022 opened this issue 3 weeks ago • 4 comments

Currently run.write.configs acts as a "stateful" logger, appending run information to samples.Rdata. As we scale to hundreds or thousands of sites, this approach becomes inefficient and tightly couples the configuration step with the analysis step.

This issue proposes refactoring the downstream analysis functions ( e.g. read.ensemble.output, read.sa.output , etc) to adopt a stateless design that reads from a manifest, removing the dependency on runtime mutation of samples.Rdata.

run.write.configs generates run ids (e.g. ENS-0001-siteID) and physically saves them into the runs.samples list within samples.Rdata. The samples.Rdata file grows linearly with the number of sites. The analysis modules depend on this file being constantly updated/mutated by the write step.

Proposed workaround :

Refactor the workflow to decouple "parameter definition" from "execution logging" by introducing a lightweight, Manifest file.

  1. samples.Rdata becomes static treat samples.Rdata strictly as a "Master parameter definition" file (generated upstream by get.parameter.samples). It should be immutable during the run.write.configs step.

  2. Introduce runs_manifest.csv instead of modifying the RData file, run.write.configs will generate a structured CSV file in the output directory. This file explicitly maps run ids to their design parameters.

proposed structure (runs_manifest.csv):

run_id site_id pft_name trait quantile type
ENS-0001-siteA siteA NA NA NA Ensemble
SA-median-siteA siteA grass NA 0.5 Sensitivity
SA-pft_name-T1-Q1-siteA siteA grass SLA 0.158 Sensitivity
  1. update downstream analysis functions (read.ensemble.output, read.sa.output, etc..) to read this CSV manifest. logic: look up the run_id where site_id == X and trait == Y

divine7022 avatar Dec 02 '25 11:12 divine7022

I agree with the first part of the issue, but I'm not sure I agree that the solution is to generate the run ids post hoc. That idea will end up sensitive to the naming scheme, which has changed in the past and could easily change in the future. My suggestion was that this info could be in a new file.

mdietze avatar Dec 03 '25 01:12 mdietze

That makes sense, relying on string construction logic 'post hoc' does create a hidden dependency on the folder naming convention, and as you said which creates debt if we ever want to rename folders.

how about run.write.configs writes a lightweight, structured file (e.g. run_manifest.csv ) containing the mapping of runid <-> attributes (site, trait, quantile). Downstream analysis functions simply read this CSV to look up the correct run id for a given trait/quantile, rather than parsing samples.Rdata or reconstructing names via logic. So we can solve sample.Rdata mutation issue and keeping run id lookup explicit ?

divine7022 avatar Dec 04 '25 09:12 divine7022

I like this idea. It's also more transparent than stuffing numerous distinct dataframes into one RData object, which is what we're doing now.

mdietze avatar Dec 04 '25 17:12 mdietze

Thanks!, I have updated the issue

divine7022 avatar Dec 04 '25 18:12 divine7022