Kenneth E. Jansen

Results 142 comments of Kenneth E. Jansen

tls_get_addr has #2 0x0000150109daceda in MPIR_Allreduce As does MPIDI_POSIX_eager_recv_begin and MPIR_Progress_hook_exec_all so perhaps that means one of the grep -v terms is the villain. I don't know these functions so...

96 nodes were able to read that same file, run 1000 steps, and wrote a new one (297GB in 12.4 seconds according to PETSc VecView timers). Currently running a second...

But we won't get write performance numbers from that run because.... ping failed on x4217c6s6b0n0: No reply from x4309c7s1b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov after 97s

> The `shmem` and `pthreads` parts in the backtraces stick out to me right now. That seems like some weird race condition to me, but within some kind of multithreading...

Second attempt at 192 with 96 written input revives our original error for this thread ``` [939]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read...

I am hopefully not jinxing it but so far larger process counts are more successful with the minus one striping choice. The 1536 has not run yet but the 768...

My second battery of jobs are running and still so far so good with no read or write failures. Since we didn't really change code and only changed the luster...

With help from Tim Williams, the mystery of why my 1536 node jobs are not running is resolved ``` kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ> /home/zippy/bin/pu_nodeStat EarlyAppAccess PARTITION: LustreApps (EarlyAppAccess) ------------------------ Nodes Status ----- ------...

WOOHOOO. We are running on 1124 nodes, 13488 tiles and thus have broken the 10k GPU barrier finally (previously CGNS+HDF5+Lustre were erroring out on the read of our inputs). No...

``` kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep VecView JZ1536Nodes1215_240108.o621462 VecView 2 1.0 2.2446e+01 1.0 9.87e+05 2.0 5.2e+05 1.6e+04 3.0e+01 2 0 0 0 0 2 0 0 0 0 574 7316723234 0 0.00e+00 0...