qmcpack icon indicating copy to clipboard operation
qmcpack copied to clipboard

Build script and Nexus support for CSCS Piz Daint

Open zenandrea opened this issue 3 years ago • 11 comments

Please review the developer documentation on the wiki of this project that contains help and requirements.

Proposed changes

Describe what this PR changes and why. If it closes an issue, link to it here with a supported keyword.

Added a configuration script for Daint (https://user.cscs.ch/access/running/piz_daint/) in config/build_cscs_daint.sh and added daint in the nexus machines

What type(s) of changes does this code introduce?

Delete the items that do not apply

  • Build related changes
  • Other (please describe): new machine in nexus

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

Daint

Checklist

Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is simply a reminder of what we are going to look for before merging your code.

  • Yes. This PR is up to date with current the current state of 'develop'
  • No. Code added or changed in the PR has been clang-formatted
  • No. This PR adds tests to cover any new code, or to catch a bug that is being fixed
  • No. Documentation has been added (if appropriate)

zenandrea avatar Jun 20 '22 16:06 zenandrea

Next time. Please always start making a PR by starting from develop to to help keep a clean history. Don't need to change this time. I will enable squash merge.

ye-luo avatar Jun 20 '22 16:06 ye-luo

Thanks Andrea. I changed the title to be more informative. It makes e.g. writing the release notes much easier.

prckent avatar Jun 20 '22 16:06 prckent

Test this please

prckent avatar Jun 20 '22 16:06 prckent

@zenandrea ntest_nexus_machines seems failing.

649/914 Test #1645: ntest_nexus_machines ..............................................................................***Failed    3.02 sec
Test name     : machines
Test sublabel : test_process_job
Test exception: "AssertionError: "
Test backtrace:
  File "/__w/qmcpack/qmcpack/nexus/bin/nxs-test", line 478, in run
    self.operation()
  File "/__w/qmcpack/qmcpack/nexus/bin/nxs-test", line 991, in machines
    nunit('process_job')
  File "/__w/qmcpack/qmcpack/nexus/bin/nxs-test", line 349, in nunit
    run_external_unit_test(test_name,unit_test)
  File "/__w/qmcpack/qmcpack/nexus/bin/nxs-test", line 388, in run_external_unit_test
    unit_test()
  File "/__w/qmcpack/qmcpack/nexus/tests/unit/test_machines.py", line 923, in test_process_job
    assert(job.processes==job.nodes*job.processes_per_node)
Test status: fail

ye-luo avatar Jun 20 '22 17:06 ye-luo

I took a look at this and see that it is correctly failing in test_machines.py around line 923. This checks for consistency between the total and per node process counts. The test uses random values that seemingly can't be satisfied the way the Daint machine is current implemented. I suggest to sync with @jtkrogel about what to do here. I don't see another machine with similar logic, so perhaps the test should be revised or some assumptions about how these machine specifications are written needed to be documented.

        # perform idempotency test
        machine_idempotent = True
        for job_input in job_inputs:
            job = Job(machine=machine.name,**job_input)
            assert(isinstance(job.processes,int))
            assert(isinstance(job.nodes,int))
            if job.processes_per_node is not None:
                assert(job.processes==job.nodes*job.processes_per_node)

prckent avatar Jun 28 '22 17:06 prckent

@zenandrea the needed updates to the machines tests can be found the following way:

nxs-test -R machines --job_ref_table | grep daint

Use the results printed with this command (and also the full statement bracketed by '''...''' for daint in the non-grepped output) to update the reference table in nexus/tests/unit/test_machines.py. In that file you will find similar information already present for all other machines (e.g. search on summit).

Following the addition of these lines, nxs-test -R machines should pass.

jtkrogel avatar Jul 21 '22 20:07 jtkrogel

This PR should go in. If there are any dangling issues, I will resolve them afterwards.

jtkrogel avatar Jun 23 '23 21:06 jtkrogel

Did the Nexus tests ever pass?

prckent avatar Jun 23 '23 21:06 prckent

(I have hopefully fixed the conflict correctly. Will see if the CI passes.)

prckent avatar Jun 23 '23 21:06 prckent

Test this please

prckent avatar Jun 23 '23 21:06 prckent

The following tests FAILED:
	1549 - ntest_nexus_machines (Failed)
Errors while running CTest

We can't merge this without either breaking the CI for ~every build or exempting Nexus from testing. Better to fix here. A short write up of how to correctly add a machine would allow me to fix this.

prckent avatar Jun 23 '23 21:06 prckent