microscope icon indicating copy to clipboard operation
microscope copied to clipboard

LinkamCMS segfaults in Linux

Open carandraug opened this issue 2 years ago • 21 comments

As reported in https://forum.image.sc/t/driving-a-linkam-cms196-with-python-microscope/78773/2 LinkamCMS fails to initialize in Linux:

Using microscope we do not succeed in completing init_sdk() – it manages to load the DLL (although we hard to write the path in hard)

__class__._lib = ctypes.CDLL("libLinkamSDK.so")

We’re then able to read the license correctly, however when we get to:

   cfunc = ctypes.CFUNCTYPE(_uint32_t, _CommsHandle, _ControllerStatus)(
       __class__._on_new_value
   )

We get this output:

*** stack smashing detected ***: terminated
Aborted

I've tried in Linux, even without hardware, and can reproduce the issue so the problem is not on the SDK.

carandraug avatar Mar 20 '23 18:03 carandraug

Here's a minimal code example showing the issue:

$ cat reproduce-273.py 
import ctypes

class _ControllerStatusFlags(ctypes.Structure):
    """ControllerStatus.flags struct from C headers"""

    _fields_ = [
        ("controllerError", ctypes.c_uint, 1),
        ("heater1RampSetPoint", ctypes.c_uint, 1),
        ("heater1Started", ctypes.c_uint, 1),
        ("heater2RampSetPoint", ctypes.c_uint, 1),
        ("heater2Started", ctypes.c_uint, 1),
        ("vacuumRampSetPoint", ctypes.c_uint, 1),
        ("vacuumCtrlStarted", ctypes.c_uint, 1),
        ("vacuumValveClosed", ctypes.c_uint, 1),
        ("vacuumValveOpen", ctypes.c_uint, 1),
        ("humidityRampSetPoint", ctypes.c_uint, 1),
        ("humidityCtrlStarted", ctypes.c_uint, 1),
        ("lnpCoolingPumpOn", ctypes.c_uint, 1),
        ("lnpCoolingPumpAuto", ctypes.c_uint, 1),
    ]

cfunc = ctypes.CFUNCTYPE(ctypes.c_uint64, _ControllerStatusFlags)
cfunc(lambda x: 1)
$ python3 reproduce-273.py 
*** stack smashing detected ***: <unknown> terminated
Aborted

Looking at ctypes docs I see this warning:

Warning ctypes does not support passing unions or structures with bit-fields to functions by value. While this may work on 32-bit x86, it’s not guaranteed by the library to work in the general case. Unions and structures with bit-fields should always be passed to functions by pointer.

And indeed that's we are doing in:

        cfunc = ctypes.CFUNCTYPE(_uint32_t, _CommsHandle, _ControllerStatus)(
            __class__._on_new_value
        )

_ControllerStatus is a union and one of its fields is the _ControllerStatusFlags bit-field struct. This only crashes if the _ControllerStatusFlags struct has more than 12 fields.

carandraug avatar Mar 20 '23 20:03 carandraug

This issue is happening when setting the callback for status events:

typedef void (*EventNewValueCallback)(CommsHandle hDevice, LinkamSDK::ControllerStatus status);

We can't pass unions and structures and bit-fields by value. But that's the interface for the callback so we can't just change it pointer.

But I think we can drop the bit-fields ourselves. ControllerStatus is a union where one member is a struct with access to each bit and a 64bit int that holds all the flags:

    union ControllerStatus
    {
        struct
        {
            unsigned    controllerError                 : 1;
            // ... 63 other bit-fields
        }               flags;                                  ///< Accessor to the flags.
        uint64_t        value;                                  ///< Flags as a single value;
    };

If the problem is the bit-field, we should be able to remove the flags struct from the ControllerStatus union and simply pass a uint64_t and manually access the bits we care about ourselves.

Unfortunately, the Linkam API does this a lot so we probably need to do the same in a bunch of other places

I do not have access the hardware to do this though.

carandraug avatar Mar 21 '23 00:03 carandraug

Test fix in my 273-controller-status-value. Needs testing in real hardware and probably the same hack needs to be applied to all other unions in LinkamCMS. A nicer fix is likely to be possible but without hardware access I won't be making any deep changes.

carandraug avatar Mar 21 '23 01:03 carandraug

Any real hardware tests will need either Tom or Jingyu to contribute. However I'm not sure either of them runs Linux anywhere. Do you think this is limited to Linux or will it also be an issue on Windows?

iandobbie avatar Mar 21 '23 13:03 iandobbie

I will try it as soon as I get access to the stage again and will keep you posted

edwardando avatar Mar 21 '23 22:03 edwardando

Hi there, this is a little complex but:

  1. On line 430... _ControllerStatusValue = _uint64_t should probably be _ControllerStatusValue = _uint64_t

  2. We have an error when making this correction, but it's due to recent developments and not the commit in itself (i.e., we don't have this error with the 0.6 version from pip, but we do have it with the latest code from master):

TypeError: __init__() missing 1 required positional argument: 'index'
Exception ignored in: <function _LinkamBase.__del__ at 0x7f098ebfef70>
Traceback (most recent call last):
  File "/home/lnd/LaserMeltingMicroscopeControl/python-microscope-carandraug-venv/lib/python3.8/site-packages/microscope/stages/linkam.py", line 1107, in __del__
    self._process_msg(Msg.CloseComms)
  File "/home/lnd/LaserMeltingMicroscopeControl/python-microscope-carandraug-venv/lib/python3.8/site-packages/microscope/stages/linkam.py", line 1119, in _process_msg
    if not self._lib.linkamProcessMessage(
AttributeError: 'NoneType' object has no attribute 'linkamProcessMessage'

edwardando avatar Mar 24 '23 15:03 edwardando

Starting again from the 0.6.0 tag, and applying your changes we definitely get further, but we're getting a segfault in open_comms():

    self._process_msg(
        Msg.OpenComms,
        byref(self._commsinfo),
        byref(self._h),
        result=self._connectionstatus,
    )

...with both versions of the .so file. Any thoughts? Thanks for your help!

edwardando avatar Mar 24 '23 16:03 edwardando

Before fixing that, I'd prefer we fix the reason why you can't use the current development version (otherwise I will need to fix the segfault in 0.6.0 and then "forwardport" it to the development). From the error message, I think the error comes because you don't specify the index argument when you construct the stage.

Can you try to construct it with LinkamCMS(uid="", index=0) and see if that moves it forward?

carandraug avatar Mar 29 '23 12:03 carandraug

Yes, will do as soon as I can access the linkam stage again!

edwardando avatar Apr 06 '23 07:04 edwardando

OK the index=0 seems to improve things, we are now able to work from your branch with the fix to line 430 mentioned above.

We're hacking the paths to the .so file and the .lsk file but this is fine for testing.

Now we're able to get further, and run your script, but the status we get back from the stage seems to say "connected=False", what could that be a symptom of?

edwardando avatar Apr 11 '23 16:04 edwardando

guys?

edwardando avatar Apr 24 '23 20:04 edwardando

Can we help?

iandobbie avatar Apr 24 '23 20:04 iandobbie

  • Is the stage running the correct firmware version for the library you are using?
  • Take a look at the other flags in the _connectionstatus on the stage object to see which error, if any, they are indicating.

mickp avatar Apr 24 '23 21:04 mickp

Adding to what @mickp said: we're redirect the logs to devnull, can you change it to some filepath (writable) and check if there's some hints there?

carandraug avatar Apr 24 '23 23:04 carandraug

Are we sure this isn't a license issue? The SDK does require a license file to work.

iandobbie avatar Apr 25 '23 00:04 iandobbie

It's not a license issue, this we've checked carefully. OK we'll look into the firmware version...

edwardando avatar Apr 25 '23 06:04 edwardando

Is there any way to borrow a windows machine and test on that. The system is usb so a loan of a laptop might help. I suggest this as the current code works well on the Diamond system and was definitely working on the Oxford one when I left 18 months ago. These both use windows and that dll. I know this may not suit your purposes but should so that the firmware etc are genarlaly working.

iandobbie avatar Apr 25 '23 13:04 iandobbie

I have borrowed a windows machine and confirmed that I can drive the stage. The state of affairs is as follows:

  • On windows: microscope works as expected (both reading values, and moving stage), but uid needs to be passed
  • On Ubuntu 20.04: reading values mostly works (uid not required), stage move gives the following error:

stack smashing detected ***: terminated Aborted

Is this normal?

edwardando avatar May 19 '23 17:05 edwardando

The UID being required is normal. in the case of Ubuntu we don't know whether that is normal since we never tested in Linux:

https://github.com/python-microscope/microscope/blob/a921847ae80d51d451a2e73af3b68070479c42b2/microscope/stages/linkam.py#L979

However, it is certainly not the desired outcome. I do not have access to a Linkam stage to fix that. Now do you know it should work in Windows, if you could fix it to work in Linux too that would be great.

carandraug avatar May 19 '23 18:05 carandraug

sure, but debugging a segfault like this is quite hard, I don't know where to start...

edwardando avatar May 19 '23 18:05 edwardando

I recommend creating a minimal example using ctypes only and without Python-Microscope. The script shouldn't be too long since there's a lot of stuff you won't need (such as setting all the callbacks and adding device settings) and you can mostly copy and paste from Python-Microscope. I would work from there to identify what it is that we're doing wrong.

carandraug avatar May 22 '23 12:05 carandraug