Kenneth E. Jansen comments

Results 142 comments of


                                            Kenneth E. Jansen

Read failure on 18,432 processes reports: CGNS error 1 mismatch in number of children and child IDs read

tls_get_addr has #2 0x0000150109daceda in MPIR_Allreduce As does MPIDI_POSIX_eager_recv_begin and MPIR_Progress_hook_exec_all so perhaps that means one of the grep -v terms is the villain. I don't know these functions so...

Read failure on 18,432 processes reports: CGNS error 1 mismatch in number of children and child IDs read

96 nodes were able to read that same file, run 1000 steps, and wrote a new one (297GB in 12.4 seconds according to PETSc VecView timers). Currently running a second...

Read failure on 18,432 processes reports: CGNS error 1 mismatch in number of children and child IDs read

But we won't get write performance numbers from that run because.... ping failed on x4217c6s6b0n0: No reply from x4309c7s1b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov after 97s

Read failure on 18,432 processes reports: CGNS error 1 mismatch in number of children and child IDs read

> The `shmem` and `pthreads` parts in the backtraces stick out to me right now. That seems like some weird race condition to me, but within some kind of multithreading...

Read failure on 18,432 processes reports: CGNS error 1 mismatch in number of children and child IDs read

Second attempt at 192 with 96 written input revives our original error for this thread ``` [939]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read...

Read failure on 18,432 processes reports: CGNS error 1 mismatch in number of children and child IDs read

I am hopefully not jinxing it but so far larger process counts are more successful with the minus one striping choice. The 1536 has not run yet but the 768...

Read failure on 18,432 processes reports: CGNS error 1 mismatch in number of children and child IDs read

My second battery of jobs are running and still so far so good with no read or write failures. Since we didn't really change code and only changed the luster...

Read failure on 18,432 processes reports: CGNS error 1 mismatch in number of children and child IDs read

With help from Tim Williams, the mystery of why my 1536 node jobs are not running is resolved ``` kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ> /home/zippy/bin/pu_nodeStat EarlyAppAccess PARTITION: LustreApps (EarlyAppAccess) ------------------------ Nodes Status ----- ------...

Read failure on 18,432 processes reports: CGNS error 1 mismatch in number of children and child IDs read

WOOHOOO. We are running on 1124 nodes, 13488 tiles and thus have broken the 10k GPU barrier finally (previously CGNS+HDF5+Lustre were erroring out on the read of our inputs). No...

Read failure on 18,432 processes reports: CGNS error 1 mismatch in number of children and child IDs read

``` kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep VecView JZ1536Nodes1215_240108.o621462 VecView 2 1.0 2.2446e+01 1.0 9.87e+05 2.0 5.2e+05 1.6e+04 3.0e+01 2 0 0 0 0 2 0 0 0 0 574 7316723234 0 0.00e+00 0...