daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-15615 test: Clear existing tmpfs mount points before running tests

Open phender opened this issue 1 year ago • 10 comments

Some test occassionaly fail to start servers due to insufficient available memory in CI due to left over DAOS mount points from a previous test. Adding an option to launch.py to provided a filter, which if specified, will be used to umount and remove the directory for any mounted tmpfs filesystems matching the filter. When using --mode=ci the filter will be set to /mnt/daos.

Skip-unit-tests: true Skip-fault-injection-test: true

Required-githooks: true

Before requesting gatekeeper:

  • [ ] Two review approvals and any prior change requests have been resolved.
  • [ ] Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • [ ] Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • [ ] Commit messages follows the guidelines outlined here.
  • [ ] Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • [ ] You are the appropriate gatekeeper to be landing the patch.
  • [ ] The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • [ ] Githooks were used. If not, request that user install them and check copyright dates.
  • [ ] Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • [ ] All builds have passed. Check non-required builds for any new compiler warnings.
  • [ ] Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • [ ] If applicable, the PR has addressed any potential version compatibility issues.
  • [ ] Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • [ ] Extra checks if forced landing is requested
    • [ ] Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • [ ] No new NLT or valgrind warnings. Check the classic view.
    • [ ] Quick-build or Quick-functional is not used.
  • [ ] Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

phender avatar May 01 '24 23:05 phender

Ticket title is 'NvmeFault.test_nvme_fault: Available memory (RAM) insufficient for configured' Status is 'In Progress' Labels: 'triaged,weekly_test' https://daosio.atlassian.net/browse/DAOS-15615

github-actions[bot] avatar May 01 '24 23:05 github-actions[bot]

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14295/1/execution/node/806/log

daosbuild1 avatar May 02 '24 01:05 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14295/2/execution/node/806/log

daosbuild1 avatar May 02 '24 15:05 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14295/4/execution/node/801/log

daosbuild1 avatar May 03 '24 23:05 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14295/5/execution/node/809/log

daosbuild1 avatar May 06 '24 22:05 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14295/6/execution/node/809/log

daosbuild1 avatar May 08 '24 05:05 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14295/7/execution/node/784/log

daosbuild1 avatar May 13 '24 16:05 daosbuild1

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14295/9/display/redirect

daosbuild1 avatar May 14 '24 03:05 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14295/9/display/redirect

daosbuild1 avatar May 14 '24 03:05 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14295/10/execution/node/809/log

daosbuild1 avatar May 16 '24 07:05 daosbuild1

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14295/12/display/redirect

daosbuild1 avatar May 17 '24 00:05 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14295/12/display/redirect

daosbuild1 avatar May 17 '24 02:05 daosbuild1

All tests passed in https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14295/13/testReport/

phender avatar May 20 '24 14:05 phender

Example: https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14295/13/artifact/Functional%20Hardware%20Large/launch/functional_hardware_large/job.log

  • Before running the first test, we see no existing tmpfs mount points:
2024/05/17 09:53:32 DEBUG            _clear_mount_points: --------------------------------------------------------------------------------
2024/05/17 09:53:32 DEBUG            _clear_mount_points: Clearing existing mount points on wolf-[51,110-117]: ['/mnt/daos', '/mnt/daos0', '/mnt/daos1']
2024/05/17 09:53:32 DEBUG                     run_remote: Running on wolf-[51,110-117] with a 120 second timeout:  df --type=tmpfs --output=target | grep -E '^(/mnt/daos|/mnt/daos0|/mnt/daos1)$'
2024/05/17 09:53:32 DEBUG                log_result_data:   wolf-[51,110-117] (rc=1): <no output>
2024/05/17 09:53:32 DEBUG _remove_shared_memory_segments: Clearing existing shared memory segments on wolf-[51,110-117]
2024/05/17 09:53:32 DEBUG                     run_remote: Running on wolf-[51,110-117] with a 120 second timeout: ipcs -m
2024/05/17 09:53:32 DEBUG                log_result_data:   wolf-[51,110-117] (rc=0):
2024/05/17 09:53:32 DEBUG                log_result_data:     
2024/05/17 09:53:32 DEBUG                log_result_data:     ------ Shared Memory Segments --------
2024/05/17 09:53:32 DEBUG                log_result_data:     key        shmid      owner      perms      bytes      nattch     status      
2024/05/17 09:53:32 DEBUG                log_result_data:     
2024/05/17 09:53:32 DEBUG                _generate_certs: --------------------------------------------------------------------------------
  • After a test that has run we then need to clean up the shared memory segments:
2024/05/17 09:59:14 DEBUG            _clear_mount_points: --------------------------------------------------------------------------------
2024/05/17 09:59:14 DEBUG            _clear_mount_points: Clearing existing mount points on wolf-[51,110-117]: ['/mnt/daos', '/mnt/daos0', '/mnt/daos1']
2024/05/17 09:59:14 DEBUG                     run_remote: Running on wolf-[51,110-117] with a 120 second timeout:  df --type=tmpfs --output=target | grep -E '^(/mnt/daos|/mnt/daos0|/mnt/daos1)$'
2024/05/17 09:59:14 DEBUG                log_result_data:   wolf-[51,110-117] (rc=1): <no output>
2024/05/17 09:59:14 DEBUG _remove_shared_memory_segments: Clearing existing shared memory segments on wolf-[51,110-117]
2024/05/17 09:59:14 DEBUG                     run_remote: Running on wolf-[51,110-117] with a 120 second timeout: ipcs -m
2024/05/17 09:59:15 DEBUG                log_result_data:   wolf-[113-114] (rc=0):
2024/05/17 09:59:15 DEBUG                log_result_data:     
2024/05/17 09:59:15 DEBUG                log_result_data:     ------ Shared Memory Segments --------
2024/05/17 09:59:15 DEBUG                log_result_data:     key        shmid      owner      perms      bytes      nattch     status      
2024/05/17 09:59:15 DEBUG                log_result_data:     0x10242048 12         daos_serve 660        1277952    0                       
2024/05/17 09:59:15 DEBUG                log_result_data:     0x10242049 13         daos_serve 660        1277952    0                       
2024/05/17 09:59:15 DEBUG                log_result_data:     
2024/05/17 09:59:15 DEBUG                log_result_data:   wolf-[110-112] (rc=0):
2024/05/17 09:59:15 DEBUG                log_result_data:     
2024/05/17 09:59:15 DEBUG                log_result_data:     ------ Shared Memory Segments --------
2024/05/17 09:59:15 DEBUG                log_result_data:     key        shmid      owner      perms      bytes      nattch     status      
2024/05/17 09:59:15 DEBUG                log_result_data:     0x10242049 12         daos_serve 660        1277952    0                       
2024/05/17 09:59:15 DEBUG                log_result_data:     0x10242048 13         daos_serve 660        1277952    0                       
2024/05/17 09:59:15 DEBUG                log_result_data:     
2024/05/17 09:59:15 DEBUG                log_result_data:   wolf-[51,115-117] (rc=0):
2024/05/17 09:59:15 DEBUG                log_result_data:     
2024/05/17 09:59:15 DEBUG                log_result_data:     ------ Shared Memory Segments --------
2024/05/17 09:59:15 DEBUG                log_result_data:     key        shmid      owner      perms      bytes      nattch     status      
2024/05/17 09:59:15 DEBUG                log_result_data:     
2024/05/17 09:59:15 DEBUG _remove_shared_memory_segments: Clearing shared memory segment 0x10242048 on wolf-[110-114]:
2024/05/17 09:59:15 DEBUG                     run_remote: Running on wolf-[110-114] with a 120 second timeout: sudo ipcrm -M 0x10242048
2024/05/17 09:59:15 DEBUG                log_result_data:   wolf-[110-114] (rc=0): <no output>
2024/05/17 09:59:15 DEBUG _remove_shared_memory_segments: Clearing shared memory segment 0x10242049 on wolf-[110-114]:
2024/05/17 09:59:15 DEBUG                     run_remote: Running on wolf-[110-114] with a 120 second timeout: sudo ipcrm -M 0x10242049
2024/05/17 09:59:15 DEBUG                log_result_data:   wolf-[110-114] (rc=0): <no output>
2024/05/17 09:59:15 DEBUG                _generate_certs: --------------------------------------------------------------------------------
  • Here we clean up a mount point and a shared memory segment
2024/05/18 12:11:06 DEBUG            _clear_mount_points: --------------------------------------------------------------------------------
2024/05/18 12:11:06 DEBUG            _clear_mount_points: Clearing existing mount points on wolf-[137-141]: ['/mnt/daos', '/mnt/daos0', '/mnt/daos1']
2024/05/18 12:11:06 DEBUG                     run_remote: Running on wolf-[137-141] with a 120 second timeout:  df --type=tmpfs --output=target | grep -E '^(/mnt/daos|/mnt/daos0|/mnt/daos1)$'
2024/05/18 12:11:07 DEBUG                log_result_data:   wolf-138 (rc=0): /mnt/daos
2024/05/18 12:11:07 DEBUG                log_result_data:   wolf-[137,139-141] (rc=1): <no output>
2024/05/18 12:11:07 DEBUG           _remove_super_blocks: Clearing existing super blocks on wolf-138
2024/05/18 12:11:07 DEBUG                     run_remote: Running on wolf-138 with a 120 second timeout: sudo rm -fr /mnt/daos/*
2024/05/18 12:11:07 DEBUG                log_result_data:   wolf-138 (rc=0): <no output>
2024/05/18 12:11:07 DEBUG _remove_shared_memory_segments: Clearing existing shared memory segments on wolf-[137-141]
2024/05/18 12:11:07 DEBUG                     run_remote: Running on wolf-[137-141] with a 120 second timeout: ipcs -m
2024/05/18 12:11:07 DEBUG                log_result_data:   wolf-137 (rc=0):
2024/05/18 12:11:07 DEBUG                log_result_data:     
2024/05/18 12:11:07 DEBUG                log_result_data:     ------ Shared Memory Segments --------
2024/05/18 12:11:07 DEBUG                log_result_data:     key        shmid      owner      perms      bytes      nattch     status      
2024/05/18 12:11:07 DEBUG                log_result_data:     
2024/05/18 12:11:07 DEBUG                log_result_data:   wolf-138 (rc=0):
2024/05/18 12:11:07 DEBUG                log_result_data:     
2024/05/18 12:11:07 DEBUG                log_result_data:     ------ Shared Memory Segments --------
2024/05/18 12:11:07 DEBUG                log_result_data:     key        shmid      owner      perms      bytes      nattch     status      
2024/05/18 12:11:07 DEBUG                log_result_data:     0x10242048 294922     daos_serve 660        958464     0                       
2024/05/18 12:11:07 DEBUG                log_result_data:     0x37173f8f 98321      daos_serve 660        331248     0                       
2024/05/18 12:11:07 DEBUG                log_result_data:     
2024/05/18 12:11:07 DEBUG                log_result_data:   wolf-141 (rc=0):
2024/05/18 12:11:07 DEBUG                log_result_data:     
2024/05/18 12:11:07 DEBUG                log_result_data:     ------ Shared Memory Segments --------
2024/05/18 12:11:07 DEBUG                log_result_data:     key        shmid      owner      perms      bytes      nattch     status      
2024/05/18 12:11:07 DEBUG                log_result_data:     0x7e65f12e 32798      daos_serve 660        331248     0                       
2024/05/18 12:11:07 DEBUG                log_result_data:     0x6fe3b5c7 32808      daos_serve 660        331248     0                       
2024/05/18 12:11:07 DEBUG                log_result_data:     
2024/05/18 12:11:07 DEBUG                log_result_data:   wolf-139 (rc=0):
2024/05/18 12:11:07 DEBUG                log_result_data:     
2024/05/18 12:11:07 DEBUG                log_result_data:     ------ Shared Memory Segments --------
2024/05/18 12:11:07 DEBUG                log_result_data:     key        shmid      owner      perms      bytes      nattch     status      
2024/05/18 12:11:07 DEBUG                log_result_data:     0xf6e42b36 163847     daos_serve 660        170448     0                       
2024/05/18 12:11:07 DEBUG                log_result_data:     0xb7326766 98321      daos_serve 660        331248     0                       
2024/05/18 12:11:07 DEBUG                log_result_data:     
2024/05/18 12:11:07 DEBUG                log_result_data:   wolf-140 (rc=0):
2024/05/18 12:11:07 DEBUG                log_result_data:     
2024/05/18 12:11:07 DEBUG                log_result_data:     ------ Shared Memory Segments --------
2024/05/18 12:11:07 DEBUG                log_result_data:     key        shmid      owner      perms      bytes      nattch     status      
2024/05/18 12:11:07 DEBUG                log_result_data:     0x6fe3b5c7 65571      daos_serve 660        331248     0                       
2024/05/18 12:11:07 DEBUG                log_result_data:     0x9b3ad358 65582      daos_serve 660        331248     0                       
2024/05/18 12:11:07 DEBUG                log_result_data:     0x10242049 327742     daos_serve 660        1277952    0                       
2024/05/18 12:11:07 DEBUG                log_result_data:     0x10242048 327743     daos_serve 660        1277952    0                       
2024/05/18 12:11:07 DEBUG                log_result_data:     
2024/05/18 12:11:07 DEBUG _remove_shared_memory_segments: Clearing shared memory segment 0x10242048 on wolf-[138,140]:
2024/05/18 12:11:07 DEBUG                     run_remote: Running on wolf-[138,140] with a 120 second timeout: sudo ipcrm -M 0x10242048
2024/05/18 12:11:07 DEBUG                log_result_data:   wolf-[138,140] (rc=0): <no output>
2024/05/18 12:11:07 DEBUG _remove_shared_memory_segments: Clearing shared memory segment 0x10242049 on wolf-140:
2024/05/18 12:11:07 DEBUG                     run_remote: Running on wolf-140 with a 120 second timeout: sudo ipcrm -M 0x10242049
2024/05/18 12:11:07 DEBUG                log_result_data:   wolf-140 (rc=0): <no output>
2024/05/18 12:11:07 DEBUG            _remove_mount_point: Clearing mount point /mnt/daos on wolf-138:
2024/05/18 12:11:07 DEBUG                     run_remote: Running on wolf-138 with a 120 second timeout: sudo umount -f /mnt/daos
2024/05/18 12:11:08 DEBUG                log_result_data:   wolf-138 (rc=0): <no output>
2024/05/18 12:11:08 DEBUG                     run_remote: Running on wolf-138 with a 120 second timeout: sudo rm -fr /mnt/daos
2024/05/18 12:11:08 DEBUG                log_result_data:   wolf-138 (rc=0): <no output>
2024/05/18 12:11:08 DEBUG                _generate_certs: --------------------------------------------------------------------------------

phender avatar May 20 '24 15:05 phender