ompi icon indicating copy to clipboard operation
ompi copied to clipboard

[CI test only] v5.0.x - Scale test PR

Open gpaulsen opened this issue 3 years ago • 21 comments

Testing scale launch on v5.0.x via https://github.com/open-mpi/ompi/wiki/PRJenkins#ibm-ci-scale-testing-adjustment-triggers mechanism.

bot:notacherrypick

gpaulsen avatar Jan 14 '22 19:01 gpaulsen

bot:ibm:scale:test

gpaulsen avatar Jan 14 '22 19:01 gpaulsen

bot:notacherrypick

gpaulsen avatar Jan 14 '22 19:01 gpaulsen

bot:ibm:scale:128:test

gpaulsen avatar Jan 17 '22 18:01 gpaulsen

bot:ibm:scale:128:test

gpaulsen avatar Jan 27 '22 19:01 gpaulsen

bot:ibm:scale:128:test

jjhursey avatar Jan 27 '22 20:01 jjhursey

Thanks for fixing back-end-regex parsing for scale testing up to 128 virtual nodes.

gpaulsen avatar Jan 28 '22 15:01 gpaulsen

bot:ibm:scale:32:test

jjhursey avatar Feb 21 '22 21:02 jjhursey

bot:ibm:scale:128:test

gpaulsen avatar Mar 21 '22 19:03 gpaulsen

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/03d1d559d97d7feed001516b2cd44849

ibm-ompi avatar Mar 21 '22 20:03 ibm-ompi

bot:aws:retest

gpaulsen avatar Mar 22 '22 14:03 gpaulsen

Most scale testing @ 128 pseudo-nodes worked, but ring_c failed with timeout after 300s. It's unclear why... I'll retry at 64, and keep an eye on it.

       Run Scale Examples : timeout --preserve-status -k 310s 310s  /workspace/exports/ompi/bin/mpirun --hostfile /workspace/hostfile.txt --npernode 2  --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca pml ob1 --mca osc ^ucx --mca btl tcp,vader,self ring_c
                          : Failed [20] on ring_c with return code 256 (0:05:00)
  Destroy Virtual Cluster : ...
                          : Passed (0:02:36)
                       -- : Some tests did not pass... (0:26:15)
---------------------------------------------------------------------------
######################################################################
########## Run Scale Examples
######################################################################
########################################
ssh c656f6n02 timeout --preserve-status -k 300s 300s   docker exec -i --env WORKSPACE=/workspace -u 59674:59674 -w /workspace/ompi-src/examples ee90ba9915fe 'timeout --preserve-status -k 310s 310s  /workspace/exports/ompi/bin/mpirun --hostfile /workspace/hostfile.txt --npernode 2  --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca pml ob1 --mca osc ^ucx --mca btl tcp,vader,self ring_c'
########################################
Process 0 sending 10 to 1, tag 201 (256 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4

gpaulsen avatar Mar 22 '22 14:03 gpaulsen

bot:ibm:scale:64:test

gpaulsen avatar Mar 22 '22 14:03 gpaulsen

64 worked well. Lets try 128 again.

gpaulsen avatar Mar 24 '22 15:03 gpaulsen

bot:ibm:scale:128:test

gpaulsen avatar Mar 24 '22 15:03 gpaulsen

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/f7a998cd453cfb31eaf89852d654cd41

ibm-ompi avatar Mar 24 '22 15:03 ibm-ompi

bot:ibm:scale:64:test

gpaulsen avatar May 12 '22 15:05 gpaulsen

bot:ibm:scale:64:test

gpaulsen avatar Jun 16 '22 14:06 gpaulsen

I just rebased to latest v5.0.x along with latest submodule pointers. Once this passes CI I'll rerun scale testing.

gpaulsen avatar Jun 16 '22 14:06 gpaulsen

bot:ibm:retest bot:ibm:nodes:32:test bot:ibm:ppn:2:test

jjhursey avatar Jul 12 '22 15:07 jjhursey

bot:ibm:scale:64:test

gpaulsen avatar Aug 25 '22 15:08 gpaulsen

bot:ibm:scale:64:test

gpaulsen avatar Aug 30 '22 18:08 gpaulsen