benchpark icon indicating copy to clipboard operation
benchpark copied to clipboard

System definition

Open pearce8 opened this issue 1 year ago • 1 comments

Updated:

Below the dashed line, we have the original system config design. Things have changed since then, and we need to:

What should the user use as a starter for their system definition?

  • [ ] describe the new system definition in the docs (@dyokelson)
  • [ ] think about simplifying creation of a new system (@scheibelp)
  • [ ] provide instructions on how to find the system with the most similar hardware (search table? search benchpark?) (@dyokelson)
  • [ ] provide instructions on how to define a new system
  • [ ] provide instructions on how to go about software stack definition (compilers, mpi, rocm version...) (@dyokelson)
  • [ ] provide instructions (or tests - ie with a small suite of tests - and correct outcomes) on how to verify the system definition is correct.

System configs currently contain different types of information, which serves different purposes:

  1. Hardware specification
  • where defined: system_definition.yaml
  • systems it applies to: a class of systems at different sites
  • longevity: duration of the system (or class of systems) lifetime
  • purpose of record: find a system with the same hardware as my system. May want to record with the experiment.
  1. Software stack: compiler and MPI locations
  • where defined: Optional?!? compilers.yaml
  • systems it applies to: just ours? can we autodetect?
  • longevity: ?
  • purpose of record: give the users a starting point to running on their system. What errors and guidance for mitigation should we give? Do we want these upstreamed back? Do we want these recorded in the experiment?
  1. Software stack: compiler and MPI versions
  • where defined: compilers.yaml
  • systems it applies to: different machines could be at different versions
  • longevity: new versions can appear any time
  • purpose: give the users a starting point, also need to record as part of experiment - and use to debug or compare performance. Probably want to let users parameterize - and set up versions to use as part of their suites.
  1. Scheduler, launcher:
  • where defined: variables.yaml
  • systems it applies to: many. Probably need a slurm and a flux schedule definition, auto generated for the user when they tell us which it is (can we autodetect?). Probably need to define a few launchers and pick one (mpirun, srun, ...)
  • longevity: static, except the queue info is baked in here unfortunately.
  • purpose: give the users a starting point. Probably don't want upstreamed, may not need to record.
  1. Software packages we don't want to keep building
  • where defined: Optional! packages.yaml
  • systems it applies to: probably just ours. May be able to find using spack external.
  • longevity: yeah may want to update versions over time.
  • purpose: shorten build time. We do not want these upstreamed, but we want to be able to record for our own experiments/CI etc.

We should probably define a graded approach for generating these:

  • only introduce a new hardware specification if one like it indeed does not exist.
  • if hardware specification exists, pick scheduler&launcher, and how to start to define the compilers and MPI to use?
  • versions should be parameterized
  • optional things can be added later (if desired)

pearce8 avatar Jan 05 '24 00:01 pearce8

We should probably define a graded approach for generating these: only introduce a new hardware specification if one like it indeed does not exist.

To be clear, are you saying we shouldn't define a new file format like foo.yaml if foo.yaml includes details already in other yaml files?

system_definitions.yaml contains duplicate entries, but that's "by design" since it's supposed to be a human readable aggregation of other details.

software stack: compiler and MPI locations where defined: Optional?!? compilers.yaml

Spack auto-detects compilers as needed. All systems except x86 probably want to explicitly define a sensible default though.

What errors and guidance for mitigation should we give? 

I do not have much experience anticipating and pre-handling issues with the wrong compilers being used. The "more-different" a user's system is, the more they should consider defining this themselves. This generally isn't an issue until Spack gets to the build phase of things (e.g. wrong compiler can't generate build artifacts). The exact error message can depend on the compilers, but also propagate to higher-level issues (e.g. building different version may change c++ standard, which generates its own set of errors depending on whether particular compiler versions support that standard).

Software stack: compiler and MPI versions where defined: compilers.yaml

MPI versions are not defined in compilers.yaml, they might be defined in packages.yaml

longevity: new versions can appear any time

when you say they can appear at any time, do you mean that the user could add a compiler definition could appear at any time? Spack won't search for compilers if any are already defined, and it doesn't search for external packages without prompting.

scheibelp avatar Jan 08 '24 17:01 scheibelp