pecan Replace stateful samples.Rdata logging with structured manifest file (runs

Currently run.write.configs acts as a "stateful" logger, appending run information to samples.Rdata. As we scale to hundreds or thousands of sites, this approach becomes inefficient and tightly couples the configuration step with the analysis step.

This issue proposes refactoring the downstream analysis functions ( e.g. read.ensemble.output, read.sa.output , etc) to adopt a stateless design that reads from a manifest, removing the dependency on runtime mutation of samples.Rdata.

run.write.configs generates run ids (e.g. ENS-0001-siteID) and physically saves them into the runs.samples list within samples.Rdata. The samples.Rdata file grows linearly with the number of sites. The analysis modules depend on this file being constantly updated/mutated by the write step.

Proposed workaround :

Refactor the workflow to decouple "parameter definition" from "execution logging" by introducing a lightweight, Manifest file.

samples.Rdata becomes static treat samples.Rdata strictly as a "Master parameter definition" file (generated upstream by get.parameter.samples). It should be immutable during the run.write.configs step.
Introduce runs_manifest.csv instead of modifying the RData file, run.write.configs will generate a structured CSV file in the output directory. This file explicitly maps run ids to their design parameters.

proposed structure (runs_manifest.csv):

run_id	site_id	pft_name	trait	quantile	type
ENS-0001-siteA	siteA	NA	NA	NA	Ensemble
SA-median-siteA	siteA	grass	NA	0.5	Sensitivity
SA-pft_name-T1-Q1-siteA	siteA	grass	SLA	0.158	Sensitivity

update downstream analysis functions (read.ensemble.output, read.sa.output, etc..) to read this CSV manifest. logic: look up the run_id where site_id == X and trait == Y

Dec 02 '25 11:12 divine7022

I agree with the first part of the issue, but I'm not sure I agree that the solution is to generate the run ids post hoc. That idea will end up sensitive to the naming scheme, which has changed in the past and could easily change in the future. My suggestion was that this info could be in a new file.

Dec 03 '25 01:12 mdietze

That makes sense, relying on string construction logic 'post hoc' does create a hidden dependency on the folder naming convention, and as you said which creates debt if we ever want to rename folders.

how about run.write.configs writes a lightweight, structured file (e.g. run_manifest.csv ) containing the mapping of runid <-> attributes (site, trait, quantile). Downstream analysis functions simply read this CSV to look up the correct run id for a given trait/quantile, rather than parsing samples.Rdata or reconstructing names via logic. So we can solve sample.Rdata mutation issue and keeping run id lookup explicit ?

Dec 04 '25 09:12 divine7022

I like this idea. It's also more transparent than stuffing numerous distinct dataframes into one RData object, which is what we're doing now.

Dec 04 '25 17:12 mdietze

Thanks!, I have updated the issue

Dec 04 '25 18:12 divine7022

Replace stateful samples.Rdata logging with structured manifest file (runs_manifest.csv)