mdsplus icon indicating copy to clipboard operation
mdsplus copied to clipboard

Thread-related problems in distributed client configuration

Open GabrieleManduchi opened this issue 4 years ago • 20 comments

I am facing a new problem wit the latest alpha in distributed client configuration. The problem is easily reproducible using jTraverser by defining the tree name in the command line, i.e.

jTraverser

and then performing any operation (e.g explode a subtree). The following error message is issued:

Error in SendArg: mode = 6, status = 65554

If jTraverser command is issued without arguments and the tree is open using the GUI everything works.

I have tracked the problem, and this happens because in the former case the tree is open in one thread and the other tree operations are performed on a separate thread.

More in detail, it happens that mdsipshr SendArg() routine calls FindConnectionWithLock() which in turn via macro MDSIPTHREADSTATIC_INIT and a sequence of nested macros (honestly this usage of macro makes the code very hardly readable, at least for me!!) ends in calling MDSplusThreadStatic(null). Since the thread is now different a new MDSplusThreadStatic_t instance is created because pthread_setspecific() returns null in this case (first time called for this thread).

However in the newly created instance, the list of existing connections (only one in our case) is lost and therefore FindConnectionWithLock fails and returns null, this causing the error.

This is a blocking issue, so we need to solve it in order to make the new alpha usable.

GabrieleManduchi avatar Jul 29 '21 13:07 GabrieleManduchi

I see that github removed the angle brackets.....when I say calling jTraverser with tree in command line I mean calling, for example:

jTraverser my_tree shot

GabrieleManduchi avatar Jul 29 '21 13:07 GabrieleManduchi

hm looks like we would need to teach jTraverser to do th einitial treeopen as event so it would use teh same thread.. not sure how though.. there is an java method to invoce jobs though maybe taht would help.. so far i would suggest to use jTraverser2? does that one work? it should

zack-vii avatar Jul 29 '21 13:07 zack-vii

The java mdsobjects should be thread safe, so it should work even in different threads. I will check.

GabrieleManduchi avatar Jul 29 '21 13:07 GabrieleManduchi

jTraverser uses java mdsobjects

GabrieleManduchi avatar Jul 29 '21 13:07 GabrieleManduchi

As a fix it would make sense to open a connection in mdsobjects in case findconnection fails.

This is a new requirement but would fix the issue on a lower level.

Background:

We had to change the mdsip behavior to have one connection per thread to solve a deeper problem (race condition through non atomic file access).

jTaverser2 does not and is therefor is not affected..

zack-vii avatar Jul 29 '21 13:07 zack-vii

Ok, things are now clearer (jTraverser and java mdsobjects correctly passes tree context among threads). This requires however a change in mdsipshr

GabrieleManduchi avatar Jul 29 '21 13:07 GabrieleManduchi

In any case this must be fixed because it breaks thread safe tree operations. There are many applications relying on it.

GabrieleManduchi avatar Jul 29 '21 13:07 GabrieleManduchi

the problem is the connection that goes with the context.. who is resolving the connection. mdsobjects? if you migrate a context to a different thread you need to reopen it.. or it is not threadsafe.. same as in python where you do need to Tree.copy() the tree

zack-vii avatar Jul 29 '21 14:07 zack-vii

For jTraverser this is not a problem, but I fear that there may be other applications relying on it. When was this change introduced?

GabrieleManduchi avatar Jul 29 '21 14:07 GabrieleManduchi

a month ago or so.. with #2288

zack-vii avatar Jul 29 '21 14:07 zack-vii

oh april

zack-vii avatar Jul 29 '21 14:07 zack-vii

ok, jTraverser has been fixed, before reporting to alpha I need to verify the other java applications

GabrieleManduchi avatar Jul 29 '21 14:07 GabrieleManduchi

Fixed the java libraries. However there is an impact also on the devices handling data streams carried out by separate threads. We have several devices this way. Ported just a couple, as soon as everything works again I will issue a PR

GabrieleManduchi avatar Aug 03 '21 09:08 GabrieleManduchi

Is there a final PR for this, so we can close it ?

joshStillerman avatar Nov 08 '21 20:11 joshStillerman

Gabriele, Timo , as Josh wrote above, is there a PR on this issue? We are now seeing the same problem when using streaming devices (which create their own threads, like acq2106_423st.py) when using the distributed client configuration.

santorofer avatar Nov 16 '21 16:11 santorofer

safest way would be to have each thread open their own connection otherwise you run into the same problems that caused the change in the first place. there is a chance for a deadlock, when threaded access to the same tree occurs through the same connection. Only way to fix this otherwise would be to make the mdsip interface asynchronous.

zack-vii avatar Nov 17 '21 09:11 zack-vii

The problem has been fixed for C++ mdsobjects in PR #2394. In the fix, Tree objects keeps track of the current thread and if required re-open the tree (creating a new context). In this case Tree and TreeNode instances can be safely exchanged among threads.

GabrieleManduchi avatar Nov 17 '21 10:11 GabrieleManduchi

However this is valid for the C++ interface, not sure about the python interface.

GabrieleManduchi avatar Nov 17 '21 10:11 GabrieleManduchi

Forgot to say, fixed also for Java interface

GabrieleManduchi avatar Nov 17 '21 10:11 GabrieleManduchi

Timo -

  I guess we should change the python tree copy so that it knows that the tree is not open?  I have this feeling that this should actually be 'fixed' someplace in treeshr, In the current state, the thread thinks it has the tree open, and thinks there is a connection but there is not.

What do you think?

-Josh

On 11/17/21 4:16 AM, Timo Schroeder wrote:

safest way would be to have each thread open their own connection otherwise you run into the same problems that caused the change in the first place. there is a chance for a deadlock, when threaded access to the same tree occurs through the same connection. Only way to fix this otherwise would be to make the mdsip interface asynchronous.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/MDSplus/mdsplus/issues/2360#issuecomment-971384892, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABY5AZMDZXYH54XOCLV635TUMNXGBANCNFSM5BGNRVEQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Joshua Stillerman Research Engineer MIT Plasma Science and Fusion Center 617.253.8176 @.***

joshStillerman avatar Nov 17 '21 13:11 joshStillerman

This does not actually work. The tree name and shot number needs to be cached in the init (which runs in the parent thread) and then instantiated in the run (which is the new thread)

joshStillerman avatar Mar 30 '23 19:03 joshStillerman