mdsplus
mdsplus copied to clipboard
IDL - mdsisclient can't check against socket=0 causing further issues when set_database claims !mdsdb_socket=0
Affiliation
DIII-D
Version(s) Affected
Current / Alpha Branch
Platform
Linux (RHEL8), IDL API (original bug), Python API (introduced bug)
Describe the bug
If set_database
is called with the intention to proxy through a mdshost, then multiple issues occur if set_database
receives the first socket id inside the IDL session (i.e. !MDSDB_SOCKET=0
).
This appears to occur because:
- MDSplus sockets allow for a socket id=0.
- Multiple IDL functions use
mdsisclient
to check a socket passed via keyword. - The
mdsisclient
function then relies onkeyword_set(socket)
. - The IDL
keyword_set
function returnsfalse
if defined and the value is zero. - The
mdsisclient
function does fall through whenkeyword_set(socket)
returnsfalse
, but... only checks against!MDS_SOCKET
.
Thus weird failure states happen when set_database
claims socket id=0, resulting in mdsisclient
being unable to confirm the socket via keyword_set
, and then ultimately dsql
and mdsvalue
don't know how and/or where to execute SQL calls through TDI.
Examples
Note that my .sybase_login file for these examples define the mdshost='atlas-el8.gat.com', so we know whether a successful dsql call gets evaluated via the intended proxy (i.e. atlas-el8,gat.com), or via localhost (i.e. omega01.gat.com), or not at all.
- The mdsip connection to the mdshost gets opened but the dsql still executes from the localhost:
[flanagan@omega01 ~]$ idl
IDL> set_database,'d3drdb'
IDL> help,!MDS_SOCKET,!MDSDB_SOCKET
<Expression> INT = -1
<Expression> INT = 0
IDL> n = dsql('select host_name()',val) & print,val
omega01.gat.com
- A subsequent mdsconnect to any server will prevent dsql from being able to execute:
[flanagan@omega01 ~]$ idl
IDL> set_database,'d3drdb'
IDL> mdsconnect,'atlas.gat.com'
IDL> help,!MDS_SOCKET,!MDSDB_SOCKET
<Expression> INT = 1
<Expression> INT = 0
IDL> n = dsql('select host_name()',val) & print,val
% DSQL: SET_DATABASE must preceed any DSQL calls
% PRINT: Variable is undefined: VAL.
% Execution halted at: $MAIN$
- Until another set_database call claims a different non-zero socket id:
[flanagan@omega01 ~]$ idl
IDL> set_database,'d3drdb'
IDL> mdsconnect,'atlas.gat.com'
IDL> set_database,'d3drdb'
IDL> help,!MDS_SOCKET,!MDSDB_SOCKET
<Expression> INT = 1
<Expression> INT = 2
IDL> n = dsql('select host_name()',val) & print,val
atlas-el8.gat.com
Discussion
I tried fixing the logic in mdsisclient but it exposed other technical debt that I'm not eager to delve into.
Instead, I'm wondering if this scenario can be triaged by disallowing mdsplus sockets with id=0?
Would starting at socket id=1 be a simpler solution?
Yes, the suggested fix is apt to be the simplest solution to avoid this IDL issue with keyword arguments. I anticipate that the fix will be easy, and that most of the effort will be setting up a test environment to confirm that the fix is indeed robust. Stephen has just assigned this issue to me, so I will start working on it immediately.
Correction: the preceding post was too optimistic.
This bug is related to Issue #2580 and PR #2581 (i.e., that fix might have inadvertently caused this new bug). The investigation will thus be a bit more involved than initially imagined. The goal of course is to come up with a solution that fixes both IDL issues (and also doesn't break mdsip for non-IDL applications).
I believe this has already been fixed previously. I will try to track down what happened to the fix This was addressed by 6eec092f4d5fac80367ce5f31cbfd6e5604a06d5
Josh, Stephen and I have discussed both issues: #2580 (PR #2581) and this #2625. Although related, we realize that they are different issues. Investigation continues.
The PR #2626 fix (start connection ids at 1) works fine if all servers can be upgraded to that software. However, that is impractical for most sites, because they will still have many severs with old software that issue connection ids starting at zero. Thus PR #2626 had to be rolled back.
The fix for this issue must thus involve only changes to the IDL code, and also must accept zero as a valid connection ID.
The approach taken for the second version of the fix is to replace IDL's problematic keyword_set(keyword_arg)
statements with arg_present(keyword_arg)
. However this approach only works if the optional keyword argument is bound to a regular variable. If it is instead bound to a system variable, such as !MDSDB_SOCKET
then the arg_present()
fails to detect the optional keyword argument. Thus, the fix involves assigning system variables to regular variables, and then passing the regular variables via the optional keyword arguments.
This fix passes the entire MDSplus test suite. And also passed manual testing of the IDL code, as shown below.
Testing #2625
$ idl
IDL 8.6.0 (linux x86_64 m64).
(c) 2016, Exelis Visual Information Solutions, Inc., a subsidiary of Harris Corporation.
IDL> ; Testing issue #2625 with version 2.0 of the fix
IDL> set_database, 'logbook'
% Compiled module: SET_DATABASE.
% Compiled module: MDSVALUE.
% Compiled module: MDSCHECKARG.
% Compiled module: MDSISCLIENT.
% Compiled module: DSQL.
IDL> print, !MDSDB_SOCKET
0
IDL> n = dsql('select host_name()', var) & print, var
% Compiled module: EVALUATE.
some_server
IDL>
IDL>
IDL> mdsconnect, 'some_server'
% Compiled module: MDSDISCONNECT.
IDL> print, !MDS_SOCKET
1
IDL> n = dsql('select host_name()', var) & print, var
some_server
IDL>
IDL>
IDL> set_database, 'logbook'
IDL> print, !MDSDB_SOCKET
2
IDL> n = dsql('select host_name()', var) & print, var
some_server
IDL>
Testing #2580
$ idl
IDL 8.6.0 (linux x86_64 m64).
(c) 2016, Exelis Visual Information Solutions, Inc., a subsidiary of Harris Corporation.
IDL> ; Testing issue #2580
IDL> mdsconnect, 'some_server'
% Compiled module: MDSCONNECT.
% Compiled module: MDSDISCONNECT.
IDL> print, !MDS_SOCKET
0
IDL>
This fix also has a minor change to Connections.c
to run an integrity check of the connection data structures when adding a new connection.
Regarding the ongoing attempts to fix the original issue in this thread...
As of build alpha_release-7-139-52 ...
The IDL API cannot open trees:
IDL> mdsconnect,'atlas.gat.com'
% Compiled module: MDSCONNECT.
% Compiled module: MDSDISCONNECT.
IDL> mdsopen,'nb',190000
% Compiled module: MDSOPEN.
% Compiled module: MDSVALUE.
% Compiled module: MDSCHECKARG.
% Compiled module: MDSISCLIENT.
% Variable is undefined: SOCKET_VAR.
% Execution halted at: MDSISCLIENT 7 /home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/idl/mdsisclient.pro
% MDSVALUE 63 /home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/idl/mdsvalue.pro
% MDSOPEN 49 /home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/idl/mdsopen.pro
% $MAIN$
IDL>
The Py3 API cannot retrieve data:
>>> import MDSplus
>>> c = MDSplus.Connection('atlas.gat.com')
>>> c.openTree('nb',190000)
>>> print,c.get('TCL("show db",_output),_output')
(<built-in function print>, '000 NB shot: 190000 [\\NB::TOP] \n\n')
>>> data = c.get('\PINJ')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/connection.py", line 323, in get
return self.conn.get(exp, *args, **kwargs)
File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/connection.py", line 217, in get
return retSerialized.deserialize()
File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/mdsarray.py", line 339, in deserialize
return _dat.Data.deserialize(self)
File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/mdsdata.py", line 852, in deserialize
return xd.value
File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/descriptor.py", line 212, in value
return self.descriptor.value
File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/descriptor.py", line 181, in value
return dtypeToClass[self.dtype].fromDescriptor(self)._setTree(self.tree)
File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/compound.py", line 187, in fromDescriptor
for i in _ver.xrange(d.ndesc)]
File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/compound.py", line 187, in <listcomp>
for i in _ver.xrange(d.ndesc)]
File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/descriptor.py", line 50, in pointerToObject
return Descriptor(pointer)._setTree(tree).value
File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/descriptor.py", line 181, in value
return dtypeToClass[self.dtype].fromDescriptor(self)._setTree(self.tree)
File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/tree.py", line 1912, in fromDescriptor
return cls(_C.cast(d.pointer, _C.POINTER(_C.c_int32)).contents.value, d.tree)
File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/tree.py", line 1203, in __new__
if str(node.usage) == "DEVICE":
File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/tree.py", line 1567, in usage
return _scr.String(str(self.usage_str)[10:])
File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/tree.py", line 332, in getter
return self._getNci(info)
File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/tree.py", line 1310, in _getNci
self.ctx, self._nid, _C.byref(item)))
File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/mdsExceptions.py", line 94, in checkStatus
raise exception
MDSplus.mdsExceptions.TreeNOT_OPEN: %TREE-W-NOT_OPEN, Tree not currently open
Mention #2631
The v3 fix implements Sean's suggestion to open an extraneous connection. Which means that when the set_database
call is made, it will be assigned a non-zero socket, thus bypassing the flaw in the way IDL implements the keyword_set()
function.
NOTE: The full IDL API was NOT tested for this fix. However, the fix has a small footprint, thus it is expected (hoped) that the rest of the API will be unaffected. Furthermore, manual testing included additional API commands.
The attached file shows the manual testing that was done for IDL. (Long boring story, but the Python API was not tested.) 2625_v3_testing.txt
After reverting PR #2620's change to connection.py
, we tested the Python API as per Sean's example above. Rolling back the change appears to have fixed the Python API bug.
Python reverts are looking good:
>>> import MDSplus
>>> c = MDSplus.Connection('atlas.gat.com')
>>> c.openTree('nb',190000)
>>> data = c.get('\PINJ')
>>> print,data.data()
(<built-in function print>, array([0., 0., 0., ..., 0., 0., 0.], dtype=float32))
>>>
Now we are getting closer to the starting point, and this time...
A set_database does indeed go through the intended mdshost proxy (i.e. atlas-el8.gat.com for my tests):
IDL> set_database,'d3drdb'
IDL> help,!MDSDB_SOCKET,!MDS_SOCKET
<Expression> INT = 1
<Expression> INT = 0
IDL> n = dsql('select host_name()',val) & print,val
atlas-el8.gat.com
IDL> n = dsql('select shot from shots where shot=190000',val) & print,val
190000
IDL>
But there's still the original fail state where a subsequent mdsconnect breaks that existing mdshost proxied database connection:
IDL> set_database,'d3drdb'
% Compiled module: SET_DATABASE.
% Compiled module: MDSDISCONNECT.
% Compiled module: MDSVALUE.
% Compiled module: MDSCHECKARG.
% Compiled module: MDSISCLIENT.
% Compiled module: DSQL.
IDL> help,!MDSDB_SOCKET,!MDS_SOCKET
<Expression> INT = 1
<Expression> INT = 0
IDL> n = dsql('select host_name()',val) & print,val
% Compiled module: EVALUATE.
atlas-el8.gat.com
IDL> n = dsql('select shot from shots where shot=190000',val) & print,val
190000
IDL> mdsconnect,'atlas.gat.com'
IDL> help,!MDSDB_SOCKET,!MDS_SOCKET
<Expression> INT = 1
<Expression> INT = 2
IDL> n = dsql('select host_name()',val) & print,val
% MDSVALUE: Error evaluating expression
0
IDL> n = dsql('select shot from shots where shot=190000',val) & print,val
% MDSVALUE: Error evaluating expression
0
IDL>
~Hmm... why did both !MDSDB_SOCKET and !MDS_SOCKET increment off of that subsequent mdsconnect call?~ EDIT: Ignore that question. My mistake by incorrectly reading the output
Hi @sflanagan -- I am very embarrassed by yet another failure. This time, it was a failure of communication on my part to explain what is causing the database proxy to break.
The specific sequence you described was tested (but was not included in the 2625_v3_testing.txt
file from the ~11:20 p.m. EDT post on 3-Oct-2023). My experiments earlier this week with the sequence you use showed that IDL is flaky (or at least not behaving as I expect).
The issue is that using print
or help
to examine the system variables, !MDS_SOCKET
and !MDSDB_SOCKET
, can occasionally break the database proxy. Why IDL behaves that way is a mystery to me.
The only way I could consistently preserve the database proxy was to omit the print
and help
statements.
Please repeat your tests, but do not use print
or help
to examine !MDS_SOCKET
and !MDSDB_SOCKET
. And let Stephen and me know if the database proxy remains intact after executing the mdsconnect
.
If the proxy does work in that scenario, and if you are satisfied with the fix, then let Stephen know and he will merge it into the alpha branch.
This evening, I will repeat testing with your sequence of commands and will upload the test results in a new post.
Note: I am offline all day on Thursday because I am flying back to Colorado. (I was at MIT PSFC the first part of this week for meetings.). However, I will be working my usual hours on Friday.
I thank you for your patience.
Mention: @WhoBrokeTheBuild
EDIT / CLARIFICATION -- The odd side effects associated with print
and help
were seen with Fix v2. It might be that the arg_present()
function used in that fix is sensitive to the side effects. If so, then the other fixes won't be affected as they don't use arg_present()
.
@mwinkel-dev
I reread the .txt
file and now recognize the difference between our tests:
- Sean's test:
set_database
,mdsconnect
- Mark's test:
set_database
,mdsconnect
,set_database
There are dsql
calls in there, but I assume you get the drift...
Presumably the rdb connection is broken in both tests once that mdsconnect
call is made.
By issuing a second set_database
call, you're establishing a new connection to the rdb which appears to work.... as one would hope and expect, so that's good.
Users will, however, expect both use cases to work... which is reasonable. In fact, my test case is what we want users to actually do in their codebase(s) as a matter of best practice... connect once to the rdb, connect once to the mdsplus db, and leave both sockets open for reuse.
Please repeat your tests, but do not use print or help to examine !MDS_SOCKET and !MDSDB_SOCKET. And let Stephen and me know if the database proxy remains intact after executing the mdsconnect.
This idea has me curious.
Unfortunately, I am still able to reproduce the error.
See below...
Test 1 - revoke the help
call to examine the system variables:
IDL> set_database,'d3drdb'
% Compiled module: SET_DATABASE.
% Compiled module: MDSDISCONNECT.
% Compiled module: MDSVALUE.
% Compiled module: MDSCHECKARG.
% Compiled module: MDSISCLIENT.
% Compiled module: DSQL.
IDL> n = dsql('select host_name()',val) & print,val
% Compiled module: EVALUATE.
atlas-el8.gat.com
IDL> n = dsql('select shot from shots where shot=190000',val) & print,val
190000
IDL> mdsconnect,'atlas.gat.com'
IDL> n = dsql('select host_name()',val) & print,val
% MDSVALUE: Error evaluating expression
0
IDL> n = dsql('select shot from shots where shot=190000',val) & print,val
% MDSVALUE: Error evaluating expression
0
Test 2 - revoke the print
call to show rdb result set:
IDL> set_database,'d3drdb'
% Compiled module: SET_DATABASE.
% Compiled module: MDSDISCONNECT.
% Compiled module: MDSVALUE.
% Compiled module: MDSCHECKARG.
% Compiled module: MDSISCLIENT.
% Compiled module: DSQL.
IDL> n = dsql('select shot from shots where shot=190000',val)
% Compiled module: EVALUATE.
IDL> mdsconnect,'atlas.gat.com'
IDL> n = dsql('select shot from shots where shot=190000',val)
% MDSVALUE: Error evaluating expression
EDIT...
Test 3 - revoke the first dsql
call... i.e. do a back-to-back set_database
and mdsconnect
before issuing any dsql
queries:
IDL> set_database,'d3drdb'
% Compiled module: SET_DATABASE.
% Compiled module: MDSDISCONNECT.
% Compiled module: MDSVALUE.
% Compiled module: MDSCHECKARG.
% Compiled module: MDSISCLIENT.
% Compiled module: DSQL.
IDL> mdsconnect,'atlas'
IDL> n = dsql('select shot from shots where shot=190000',val)
% MDSVALUE: Error evaluating expression
% Compiled module: EVALUATE.
Hi @sflanagan -- And unfortunately, I am also able to reproduce the mdsvalue
error. This is likely a latent bug that has been present in the IDL API for years. Am investigating it this evening and hope to have a fix for it later tonight. (And will also gather test results of the weird print
and help
issue that I saw a few days ago.)
Anyway, thanks for removing the print
and help
and posting the results. Much appreciated.
There are two different failure modes: one is likely easy to fix, one is more complicated.
Simple
The mdsvalue
error can be triggered immediately by using the following sequence.
set_database, my_db
n=dsql('select host_name()', val) & print, val
mdsconnect, my_server
n=dsql('select host_name()', val) & print, val
Repeating the above with mdsconnect, my_server, socket=0
also immediately triggers the error.
Puzzling
An easy way to avoid the "Simple" issue is to do the following.
set_database, my_db
n=dsql('select host_name()', val) & print, val
mdsconnect, my_server, socket=-1
n=dsql('select host_name()', val) & print, val
Unfortunately, repeating the "connect and query" a few times (usually around five times), eventually causes a mdsvalue
error. Understanding and fixing this failure mode will take some time (i.e., I will investigate it while at the airport on Thursday waiting for my flight).
Surprisingly, the mdsconnect
routine was also doing a mdsdisconnect
in one common scenario. (Apparently that was being done to create the !MDS_SOCKET
system variable?). Commented out that mdsdisconnect
and used defsysv
to define !MDS_SOCKET
. This change allows the database proxy to work even after some additional mdsconnect
statements are issued (i.e. the "simple" case described in the previous post).
However, if continue issuing mdsconnect
statements, the database proxy eventually breaks (i.e., the "puzzling" case described in the previous post).
NOTE - I have to catch a flight and will be offline for the rest of Thursday 5-Oct-2023. So did not have time to thoroughly investigate this issue. And make no guarantees regarding the new change.
Fix v3 (PR 2632) was not robust. It failed to preserve the database proxy established by set_database
. The root issue again was simply that IDL's keyword_set()
has slightly different behavior than expected. So Fix v3 has been dropped. And now we have switched to Fix v4, which solves the problem by replacing the keyword_set()
function.
Before diving into the details, first a brief overview of the fixes.
- Fix v1: Used Sean's idea of starting connection counting at 1 (instead of zero). Smallest footprint. Does not work for customers that have servers with new software connecting to servers with old software.
- Fix v2: Josh's idea of using IDL's
arg_present()
function insidemdsisclient.pro
instead ofkeyword_set()
. Did not handle allmdsvalue
scenarios, plus was overly complicated. - Fix v3: Used Sean's idea of opening an extra connection so that the initial
set_database
call would not have a non-zero socket, and thus wouldn't trigger thekeyword_set()
failure mode. Small footprint. Handled most cases A-OK. But failed because the database proxy would break if executemdsconnect
afterset_database
. Plus, the workaround for that issue triggered an error inmdsvalue
. - Fix v4: Josh's idea of tackling the
keyword_set()
issue head on by replacing it with a function that does what the rest of the IDL API code expects. The bulk of this fix was developed on 4-Oct-2023 early in the morning (same time as Fix 3 was developed).
Now for the details on Fix v4 . . .
IDL's keyword_set()
feature was designed to detect whether optional keyword arguments are "set" (non-zero) or not. This function is useful for dealing with keywords that enable / disable features.
When the MDSplus IDL API was written, the developers incorrectly assumed that keyword_set()
was detecting the presence of the optional keyword. Which unfortunately it does not do when the optional keyword is present but has a zero value.
Fix v4 thus replaces all calls to keyword_set()
with a new function, mds_keyword_set()
that reliably detects the presence of an optional keyword argument, regardless of its value. Doing so meets the expectations of the rest of the code in the IDL API.
In addition to fixing the primary issue of Issue 2625, this fix also includes the following improvements:
- fixes the
mdsdisconnect
routine so it correctly passes the socket (if specified) to the underlying C library - includes an integrity check of data structures when adding a new connection
- does a partial rollback of PR 2620 so that the Python connections are no longer broken
However, Fix v4 has a maximum of 64 concurrent connections. A new issue will be filed to remove that limit, or at least understand what is causing it.
Hi @sflanagan and @ModestMC -- Fix v4 is now available as PR 2635. Josh, Stephen and I all tested the fix. However, our testing was only on Ubuntu. (We do not presently have IDL running on a RHEL8 server.)
@mwinkel-dev If you do not have access to our latest RHEL8 cluster, let's get you access.
Hi @smithsp -- Thanks for suggesting that I obtain access to GA's RHEL8 server, in case it is needed for testing future PRs.
My understanding (perhaps incorrect) is that the situation is as follows:
-
After today's holiday, MIT PSFC can create a RHEL8 virtual machine that has IDL. (And perhaps already has such a virtual machine that I am just unaware of.). During my business trip to MIT last week, I only had access to a RHEL8 Docker image (which does not contain IDL because of licensing issues). Last Friday, I used the Docker image to confirm that Fix v4 can build on RHEL8 without issue.
-
Regarding access at GA, I presently have access to the
cybele
bastion host and thus can also accessiris
andomega
. Is that sufficient? If not, let me know what other server(s) I should request access to. (And of course, I also have access tobaldur
andtokzonemit
. I usebaldur
to remotely shutdown the MIT PSFC computers prior to DIII-D power outages, such as the one that is scheduled for this Thursday 10/12 at 3:00 p.m. PDT. In the very unlikely event that the shutdown is rescheduled for the morning of Thursday 10/12, please let me know.)
Hi @sflanagan and @ModestMC -- I just noticed that Josh reviewed and approved PR #2635 (aka Fix v4). So I have just merged it into the alpha
branch.
Although today is a holiday at MIT PSFC, I will be checking my email occasionally throughout the day. Let me know if you encounter any problems with Fix v4.
In the process of fixing Issue 2625, the investigation / testing uncovered these additional issues:
- The data integrity check of the connection (socket) linked list was not being executed because of a typo in PR 2288. (The typo was fixed as part of the PR 2635 fix for this issue.)
- The IDL API's
mdsdisconnect
routine was ignoring the optionalsocket
keyword, when specified. (This too was fixed as part of PR 2635.) - Issue 2637 was filed regarding references to obsolete operating systems in the
mdsconnect.pro
file. - Issue 2638 describes a crash that occurs when attempt to create more than 64 concurrent connections (sockets).
- Issue 2639 illustrates that
mdsvalue
can fail in unusual circumstances and that the bug has been present in the IDL API for many years. - Issue 2640 demonstrates that
mdsdisconnect
does not correctly return the optional status. And also notes that similar problems exist with status returned from othercall_external
statements in the IDL API.
This issue has also generated the following . . .
PR #2643, Issue #2650 and PR #2656.