Build script and Nexus support for CSCS Piz Daint
Please review the developer documentation on the wiki of this project that contains help and requirements.
Proposed changes
Describe what this PR changes and why. If it closes an issue, link to it here with a supported keyword.
Added a configuration script for Daint (https://user.cscs.ch/access/running/piz_daint/) in config/build_cscs_daint.sh and added daint in the nexus machines
What type(s) of changes does this code introduce?
Delete the items that do not apply
- Build related changes
- Other (please describe): new machine in nexus
Does this introduce a breaking change?
- No
What systems has this change been tested on?
Daint
Checklist
Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is simply a reminder of what we are going to look for before merging your code.
- Yes. This PR is up to date with current the current state of 'develop'
- No. Code added or changed in the PR has been clang-formatted
- No. This PR adds tests to cover any new code, or to catch a bug that is being fixed
- No. Documentation has been added (if appropriate)
Next time. Please always start making a PR by starting from develop to to help keep a clean history. Don't need to change this time. I will enable squash merge.
Thanks Andrea. I changed the title to be more informative. It makes e.g. writing the release notes much easier.
Test this please
@zenandrea ntest_nexus_machines seems failing.
649/914 Test #1645: ntest_nexus_machines ..............................................................................***Failed 3.02 sec
Test name : machines
Test sublabel : test_process_job
Test exception: "AssertionError: "
Test backtrace:
File "/__w/qmcpack/qmcpack/nexus/bin/nxs-test", line 478, in run
self.operation()
File "/__w/qmcpack/qmcpack/nexus/bin/nxs-test", line 991, in machines
nunit('process_job')
File "/__w/qmcpack/qmcpack/nexus/bin/nxs-test", line 349, in nunit
run_external_unit_test(test_name,unit_test)
File "/__w/qmcpack/qmcpack/nexus/bin/nxs-test", line 388, in run_external_unit_test
unit_test()
File "/__w/qmcpack/qmcpack/nexus/tests/unit/test_machines.py", line 923, in test_process_job
assert(job.processes==job.nodes*job.processes_per_node)
Test status: fail
I took a look at this and see that it is correctly failing in test_machines.py around line 923. This checks for consistency between the total and per node process counts. The test uses random values that seemingly can't be satisfied the way the Daint machine is current implemented. I suggest to sync with @jtkrogel about what to do here. I don't see another machine with similar logic, so perhaps the test should be revised or some assumptions about how these machine specifications are written needed to be documented.
# perform idempotency test
machine_idempotent = True
for job_input in job_inputs:
job = Job(machine=machine.name,**job_input)
assert(isinstance(job.processes,int))
assert(isinstance(job.nodes,int))
if job.processes_per_node is not None:
assert(job.processes==job.nodes*job.processes_per_node)
@zenandrea the needed updates to the machines tests can be found the following way:
nxs-test -R machines --job_ref_table | grep daint
Use the results printed with this command (and also the full statement bracketed by '''...''' for daint in the non-grepped output) to update the reference table in nexus/tests/unit/test_machines.py. In that file you will find similar information already present for all other machines (e.g. search on summit).
Following the addition of these lines, nxs-test -R machines should pass.
This PR should go in. If there are any dangling issues, I will resolve them afterwards.
Did the Nexus tests ever pass?
(I have hopefully fixed the conflict correctly. Will see if the CI passes.)
Test this please
The following tests FAILED:
1549 - ntest_nexus_machines (Failed)
Errors while running CTest
We can't merge this without either breaking the CI for ~every build or exempting Nexus from testing. Better to fix here. A short write up of how to correctly add a machine would allow me to fix this.