ovis icon indicating copy to clipboard operation
ovis copied to clipboard

transient sets aggregated with push=onchange not leaving L1 after deletion on L0

Open baallan opened this issue 1 year ago • 14 comments

The pid sampler (linux-proc-sampler) creates sets for pids it's notified of and clears these sets when the pid monitored goes away. It appears that ~1/3 of these sets (388/1200) never get cleaned up on the L1 aggregator, even after 1300 seconds.

baallan avatar Mar 16 '23 05:03 baallan

Will run more diagnostics to see if this looks a lot like #771 at the L1.

baallan avatar Mar 16 '23 06:03 baallan

The symptom is repeatable, but a bit arbitrary, thusly:

  • run pid sampler on 20 nodes.
  • launch 36 ranks/node of mpiGraph using mpiexec or srun.
  • job completes; tracked pids all gone.
  • for 2 arbitrary (varies run to run which node) nodes, the 36 local pid-related sets are deleted by the sampler, but hang in the deleting state according to set_stats.
  • Also for those same nodes, the messages that the pids are gone do not get delivered to the L1, even though all such messages are published at L0 before the corresponding set_deletes are performed. This is just plain weird and in fact all subsequent message deliveries (e.g. new pid appearance) on the same 2 nodes also do not reach L1.

A look with gdb shows that all threads are in epoll wait at what looks like a 'normal' place in their respective task loops.

baallan avatar Mar 19 '23 22:03 baallan

It appears this issue has been resolved by #1153.

baallan avatar Apr 05 '23 01:04 baallan

Actually, it's still found in further testing... @nichamon @tom95858 what's the good new encantation to enable debugging messages for just the set management?

baallan avatar Apr 05 '23 15:04 baallan

@nichamon my launch of mpigraph is

#! /bin/bash
#SBATCH --time=60:00
#SBATCH -N 20
srun --mpi=pmi2 -N 20 -n $((36*20)) mpiGraph 16384 100 50

where the 20 nodes are dual 18-core processors, hence 36 * 20 tasks. The daemon setup is:

  • spank plugin running notifications to ldmsd on the compute nodes of new pids.
  • compute node ldmsd are all aggregated to a single admin node agg ldmsd.
  • The compute nodes create and delete sets as PIDs come/go driven by slurm (linux_proc_sampler).

For some subset of the nodes, something happens such that the deleting_count is permanently raised.

c1x8: Name                 Count
c1x8: -------------------- ----------------
c1x8: active_count                       15
c1x8: deleting_count                    109
c1x8: mem_total_kb                    16384
c1x8: mem_free_kb                     15452
c1x8: mem_used_kb                       932
c1x8: set_load                            0

baallan avatar Apr 05 '23 20:04 baallan

@baallan Thanks for the info!

I assume that no L2 aggregated from L1.

nichamon avatar Apr 05 '23 21:04 nichamon

@baallan Thanks for the info!

I assume that no L2 aggregated from L1.

Correct

baallan avatar Apr 05 '23 21:04 baallan

@tom95858

Also for those same nodes, the messages that the pids are gone do not get delivered to the L1, even though all such messages are published at L0 before the corresponding set_deletes are performed. This is just plain weird and in fact all subsequent message deliveries (e.g. new pid appearance) on the same 2 nodes also do not reach L1.

This observation made me think that the root cause may not be related to the delete and push path. It is instead in the transport path. Did Ben and you reproduce this today? What do you think?

nichamon avatar Apr 06 '23 04:04 nichamon

@nichamon with the new logging stuff, whats the syntax to turn on logging for just the transport (and maybe the set management) code but no sampler plugin logging? or do we still need a recompile with extra flags for transport?

baallan avatar Apr 06 '23 14:04 baallan

@nichamon with the new logging stuff, whats the syntax to turn on logging for just the transport (and maybe the set management) code but no sampler plugin logging? or do we still need a recompile with extra flags for transport?

With the top of OVIS-4, everything is still the same regarding changing log levels. I'm incrementally creating patches of the refactored code.

When all the refactored code is in the tree, you will do 'loglevel regex=xprt.* level=DEBUG' to turn on the DEBUG messages of all transport layers, i.e., ldmsd, ldms, and Zap.

nichamon avatar Apr 06 '23 15:04 nichamon

@baallan I'm making a patch and will create a pull request within 1-2 hours from now. I'll tag you when I have it for you to test.

nichamon avatar Apr 06 '23 15:04 nichamon

@nichamon @tom95858 This problem also reproduces with qlogic InfiniPath_QLE7340 adapters on toss3, the hardware on our tlcc2 systems, so it's not just omnipath-specific rdma with the sets stuck in deleting mode. The dev system here with that hardware is called btaco. The version running is near the top of tree (sum 7dfaa).

baallan avatar May 05 '23 05:05 baallan

@nichamon @tom95858 This problem also reproduces with qlogic InfiniPath_QLE7340 adapters on toss3, the hardware on our tlcc2 systems, so it's not just omnipath-specific rdma with the sets stuck in deleting mode. The dev system here with that hardware is called btaco. The version running is near the top of tree (sum 7dfaa).

@baallan Thanks for the info. I'll try to send a patch to you to get more diagnostic messages, or we can setup a session to work on it together, hopefully, next week.

nichamon avatar May 05 '23 16:05 nichamon

@nichamon if you have a branch with more debugging stuff to try (possibly including additional -D flags at compile time), let me know.

baallan avatar May 08 '23 15:05 baallan