CausalPy icon indicating copy to clipboard operation
CausalPy copied to clipboard

Fix reproducibility, refactor `simulate_data.py`, use functions in tests

Open louismagowan opened this issue 1 month ago • 0 comments

Problem

Seems like causalpy/data/simulate_data.py module has some reproducibility issues and a bit of refactoring that could be done

Reproducibility

The module declares a seeded RNG but doesn't use it consistently:

rng = np.random.default_rng(RANDOM_SEED)  # Declared on line 27, only used once

# Most functions use unseeded random:
norm(0, 0.25).rvs(N)           # scipy.stats uses global numpy state
np.random.choice(2, size=N)     # Uses global numpy state

Result: Functions produce different data each run. Generated CSV files cannot be reproduced.

CSV Usage:

  • Many CSVs committed to git
  • Cannot regenerate them deterministically

Proposed Solution

  1. Add seed parameter to all generation functions
  2. Replace norm().rvs() with rng.normal(), dirichlet().rvs() with rng.dirichlet(), etc.
  3. Delete generated CSV files; use pytest fixtures instead
  4. Update tests to generate data dynamically
  5. Fix bug: create_series() ignores length_scale parameter (line 488)
  6. Reduce duplication (lines 87-93: repeated function calls)
  7. Other light touch refactoring (separation of responsibility in functions, reduce LOC on _smoothed_gaussian_random_walk (lines 87-5))

louismagowan avatar Nov 02 '25 00:11 louismagowan