ovis icon indicating copy to clipboard operation
ovis copied to clipboard

prdcr_stop/prdcr_del issuing messages

Open baallan opened this issue 2 years ago • 9 comments

When i do a prdcr_stop/prdcr_del on a producer, i get a message:

The producer is in use.

It appears __ldmsd_prdcr_stop does:

        if (!prdcr->xprt)
                prdcr->conn_state = LDMSD_PRDCR_STATE_STOPPED;

but afaict none of the code before this does

prdcr->xprt = NULL;

Should the reassignment of conn_state be unconditional?

baallan avatar Jun 15 '22 21:06 baallan

@narategithub any thoughts on this?

baallan avatar Jun 15 '22 21:06 baallan

@nick-enoent I expect that if maestro is managing producers, then this ldms issue will affect maestro in unhappy ways.

baallan avatar Jun 16 '22 16:06 baallan

Don't know if there's any protocol relation that would affect #771 also.

baallan avatar Jun 16 '22 17:06 baallan

@baallan are you doing ldmsd_controller commands while maestro is running?

tom95858 avatar Sep 14 '22 05:09 tom95858

@tom95858 no, i'm seeing this with configurations that don't involve maestro in any way. Basically, if i configure an aggregator and then later send the statement list that inverts the configuration to stop/remove everything, I get the message mentioned above. As it happens, the teardown script is piped through ldmsctl rather than ldmsd_controller.

reproducer is (assuming configured without --disable-scripts):

ldms-static-test.sh many

then inspect output: ldmstest/many/logs/teardown.1.txt and inputs:

  • the setup ldmstest/many/run/conf.1

  • the teardown ldmstest/many/run/revconf.1

baallan avatar Sep 14 '22 19:09 baallan

@baallan I cannot reproduce the error. All daemons exited gracefully. I looked at ldmstest/many/run/revconf.1 prdcr_del is sent right after prdcr_stop to the aggregator. A possibility that resulted in the "prdcr is in use" is that ldmsd processed prdcr_del before it processed the DISCONNECTED event. However, I have never run into this situation, even when I tested with ldms-static-test.sh many.

By the way, I could not run ldms-static-test.sh many successfully without https://github.com/ovis-hpc/ovis/pull/1019.

$ grep -rn 'in use' ldmstest/* returned nothing.

Below is a tail output of 1.txt. It shows the producer localhost2 and localhost3 were stopped properly.

Thu Sep 15 22:03:46 2022: INFO      : prdcr_connect_cb:578 Producer localhost3 (sock localhost:61079) conn_state: 4 LDMSD_PRDCR_STATE_STOPPING event type: DISCONNECTED
Thu Sep 15 22:03:46 2022: INFO      : Producer localhost3 is disconnected (sock localhost:61079)
Thu Sep 15 22:03:46 2022: INFO      : Deleting producer set localhost3/clock from producer localhost3
Thu Sep 15 22:03:46 2022: INFO      : Deleting producer set localhost3/job_info from producer localhost3
Thu Sep 15 22:03:46 2022: INFO      : Deleting producer set localhost3/meminfo from producer localhost3
Thu Sep 15 22:03:46 2022: INFO      : Deleting producer set localhost3/procstat from producer localhost3
Thu Sep 15 22:03:46 2022: INFO      : Deleting producer set localhost3/vmstat from producer localhost3
Thu Sep 15 22:03:46 2022: INFO      : prdcr_connect_cb:676 Producer (after reset) localhost3 (sock localhost:61079) conn_state: 0 LDMSD_PRDCR_STATE_STOPPED
Thu Sep 15 22:03:46 2022: INFO      : prdcr_connect_cb:578 Producer localhost2 (sock localhost:61078) conn_state: 4 LDMSD_PRDCR_STATE_STOPPING event type: DISCONNECTED
Thu Sep 15 22:03:46 2022: INFO      : Producer localhost2 is disconnected (sock localhost:61078)
Thu Sep 15 22:03:46 2022: INFO      : Deleting producer set localhost2/clock from producer localhost2
Thu Sep 15 22:03:46 2022: INFO      : Deleting producer set localhost2/job_info from producer localhost2
Thu Sep 15 22:03:46 2022: INFO      : Deleting producer set localhost2/meminfo from producer localhost2
Thu Sep 15 22:03:46 2022: INFO      : Deleting producer set localhost2/procstat from producer localhost2
Thu Sep 15 22:03:46 2022: INFO      : Deleting producer set localhost2/vmstat from producer localhost2
Thu Sep 15 22:03:46 2022: INFO      : prdcr_connect_cb:676 Producer (after reset) localhost2 (sock localhost:61078) conn_state: 0 LDMSD_PRDCR_STATE_STOPPED
Thu Sep 15 22:03:46 2022: INFO      : User requested exit.
Thu Sep 15 22:03:46 2022: DEBUG     : mmap use at exit: mm_stat: size=524288 grain=1024 chunks_free=1 grains_free=512 grains_largest=512 grains_smallest=512 bytes_free=524288 bytes_largest=524288 bytes_smallest=524288 bytes_used+holes=0
Thu Sep 15 22:03:46 2022: LDMSD_ LDMS Daemon exiting...status 0, 
Thu Sep 15 22:03:46 2022: LDMSD_ cleanup end.

nichamon avatar Sep 16 '22 03:09 nichamon

@nichamon With procstat fixed, i'm still sometimes seeing this producer message. much more frequently with valgrind in action than without, which suggests it is indeed some timing issue. What i have found the hard way is that any timing thing that valgrind provokes will also occur at cluster scale due to delays there. a patch to the 'many' script something like below is likely to aid in reproducing. (enabling valgrind).

diff --git a/ldms/scripts/examples/many b/ldms/scripts/examples/many
index 90d096a..504a0f2 100644
--- a/ldms/scripts/examples/many
+++ b/ldms/scripts/examples/many
@@ -1,11 +1,18 @@
 portbase=61076
 MESSAGE starting agg and two collectors
 JOBDATA $TESTDIR/job.data 1 2 3
-#vgon
-VGARGS="--track-origins=yes --leak-check=full --show-leak-kinds=all"
-LDMSD -p prolog.jobidinfo `seq 3`
-#vgoff
+VGARGS="--tool=drd"
+VGARGS="--track-origins=yes --leak-check=full --show-leak-kinds=definite"
+DAEMONS $(seq 3)
+vgon
+LDMSD -p prolog.jobidinfo 1
+LDMSD -p prolog.jobidinfo `seq 2 3`
+vgoff
 LDMS_LS 1

baallan avatar Sep 16 '22 06:09 baallan

@baallan The procstat fix is unrelated to the in-use producer message. However, without the fix, your test cannot run successfully.

As I mentioned, it is possible that users will see the "the producer is in use" message if the producer is not ready to be deleted. It will be ready when there are no updaters to refer to it; the producer has been stopped, and it is aware that the connection to the server has been disconnected. In your case, the first and second conditions are met, but the last condition is not. prdcr_del will succeed later after LDMSD has received and processed the DISCONNECTED event. It is not that you cannot delete the producer at all.

What scenario do you try to test from this particular section of the test logic?

nichamon avatar Sep 16 '22 15:09 nichamon

The scenario being tested is that the aggregator is sent the revconf.1 script to tear it down, then the 2 samplers are deconfigured by being sent (e.g. many/run/revconf.2). Are you saying i need a sleep for processing somewhere in the following script? Or that samplers must all be stopped first to meet the cited conditions?

revconf.1

strgp_stop name=store_csv_jobi
strgp_prdcr_del name=store_csv_jobi regex=.*
strgp_del name=store_csv_jobi

strgp_stop name=store_csv_clock
strgp_prdcr_del name=store_csv_clock regex=.*
strgp_del name=store_csv_clock

strgp_stop name=store_csv_vmstat
strgp_prdcr_del name=store_csv_vmstat regex=.*
strgp_del name=store_csv_vmstat

strgp_stop name=store_csv_procstat
strgp_prdcr_del name=store_csv_procstat regex=.*
strgp_del name=store_csv_procstat

strgp_stop name=store_csv_meminfo
strgp_prdcr_del name=store_csv_meminfo regex=.*
strgp_del name=store_csv_meminfo

term name=store_csv

updtr_stop name=allhosts
updtr_prdcr_del name=allhosts regex=.*
updtr_del name=allhosts

prdcr_stop name=localhost3
prdcr_del name=localhost3

prdcr_stop name=localhost2
prdcr_del name=localhost2
stop name=jobid
term name=jobid
daemon_exit

baallan avatar Sep 19 '22 15:09 baallan