ophyd icon indicating copy to clipboard operation
ophyd copied to clipboard

resolve EpicsMotor stalls sometimes at end of move

Open prjemian opened this issue 6 years ago • 20 comments

fixes #782

prjemian avatar Dec 13 '19 21:12 prjemian

Much of this logic is already in https://github.com/bluesky/ophyd/blob/3cbf73b8f71b9c2494ffb097ac1a5e9b5c0f1e23/ophyd/epics_motor.py#L224-L271

I suspect that this is going to interact badly with that logic.


It is subscribed a bit differently on the 1.3.x branch

https://github.com/bluesky/ophyd/blob/9f21ca2690a4a8a1db86d2eeb9a5628e9fc03123/ophyd/epics_motor.py#L81 and https://github.com/bluesky/ophyd/blob/9f21ca2690a4a8a1db86d2eeb9a5628e9fc03123/ophyd/epics_motor.py#L222-L267

tacaswell avatar Dec 13 '19 22:12 tacaswell

Haha, I see three of us pounced in parallel.Thanks for the contribution, @prjemian, and good luck sifting through our feedback. :upside_down_face:

danielballan avatar Dec 13 '19 22:12 danielballan

Much of this logic is already in

https://github.com/bluesky/ophyd/blob/3cbf73b8f71b9c2494ffb097ac1a5e9b5c0f1e23/ophyd/epics_motor.py#L224-L271

I suspect that this is going to interact badly with that logic.

It is subscribed a bit differently on the 1.3.x branch

https://github.com/bluesky/ophyd/blob/9f21ca2690a4a8a1db86d2eeb9a5628e9fc03123/ophyd/epics_motor.py#L81 and

https://github.com/bluesky/ophyd/blob/9f21ca2690a4a8a1db86d2eeb9a5628e9fc03123/ophyd/epics_motor.py#L222-L267

This might be the better place for the logic in the dmov_callback(). This logic:

        was_moving = self._moving
        self._moving = (value != 1)

should change to:

        was_moving = self._moving
        self._moving = not self.motor_done_move.value

This logic is problematic:

        if was_moving and not self._moving:

since it requires the transition to declare the move complete. Suggest relocating the done_moving()` call thus:

        if self._started_moving and not self._moving:
            self._done_moving(success=success, timestamp=timestamp,
                              value=value)

prjemian avatar Dec 14 '19 04:12 prjemian

So, removing the dmov_callback() from move() and adding to the _move_changed() method as described results in the motor stall re-appearing. Where is the status message declared done? It's not directly in positioner._done_moving(). Very tangly code. Is it one of the st._finished() lines in signal.py?

prjemian avatar Dec 14 '19 05:12 prjemian

In the end, I restored the dmov_callback() to ensure the status message is finished. There is no problem with calling this more than once, the code has an early bailout for

        if self.done:
            return

Is there a better way to be certain that st._finished() gets called?

prjemian avatar Dec 14 '19 05:12 prjemian

since it requires the transition to declare the move complete

My understanding is that the motor record will always flash DMOV low and then high to indicate that it is moving (even if it already at it's target destination).

I am still :-1: on this as-is because it is short-circuiting a whole bunch logic in _move_changed which I assume is there for a reason. If there are cases where that logic is currently correctly not reporting the motion done, this extra callback will flip the status and erroneously report that motion is done when it is not.

Do we have a clear understanding of the source of the problem? @prjemian do you have any timing diagrams of what the various PVs are doing when it does and does not work correctly?

My current guess is that we are hitting race-conditions on some of the internal state of EpicsMotor between the code running on the monitor thread and the code running on the main thread that is causing the next m.move() execution to overlap with the previous _move_changed execution. Thus, we miss the falling edge of DMOV to indicate that we started moving for the move the user requested and then ignore the rising edge because we think that does not belong to us.


In https://github.com/bluesky/ophyd/blob/3cbf73b8f71b9c2494ffb097ac1a5e9b5c0f1e23/ophyd/positioner.py#L195-L198

we subscribe the status object to self._SUB_REQ_DONE on the ophyd object and then in https://github.com/bluesky/ophyd/blob/3cbf73b8f71b9c2494ffb097ac1a5e9b5c0f1e23/ophyd/positioner.py#L200-L208 process those callbacks which is in turn called from https://github.com/bluesky/ophyd/blob/3cbf73b8f71b9c2494ffb097ac1a5e9b5c0f1e23/ophyd/epics_motor.py#L270-L271

tacaswell avatar Dec 16 '19 13:12 tacaswell

@tacaswell : The .DMOV transitions from 1 to 0 at the start of a move and transitions from 0 to 1 at the end, after the completion of the primary move plus any additional move complexities (such as backlash corrections or retries) controlled by the motor record. See this documentation for the details: https://epics.anl.gov/bcda/synApps/motor/motorRecord.html#Fields_status

About the .DMOV field, the documentation states:

This field is set to 0 when the motor record begins a motion, and remains 0 through any retries and backlash corrections that may be required until the motor record has completely finished that motion, whereupon the field is set to 1. DMOV is guaranteed to execute and post a 1/0/1 pulse when the motor is commanded to move--even if no motion actually occurs because the motor was commanded to move to its current position.

prjemian avatar Dec 16 '19 14:12 prjemian

It is clear from the structure of the plan used to test this problem (when executed with logging level set to DEBUG):

def move(motor, label, dest, delay_s):
    yield from bps.checkpoint()
    yield from bps.mv(m1, dest)
    print(f"{label}:   {dest} {motor.position}")
    yield from bps.sleep(delay_s)

that the next move has not yet begun. If the next move command had been issued, the RunEngine logging would indicate a checkpoint message had been processed since the most recent move activity. What makes this obvious is that no checkpoint message is logged at the time the motor stalls. The logging indicates the motor has finished its most recent move yet the RunEngine is waiting.

prjemian avatar Dec 16 '19 14:12 prjemian

Is there a better way to be certain that st._finished() gets called?

prjemian avatar Dec 16 '19 14:12 prjemian

I agree that this change has implications of change for additional methods such as _move_changed() and/or _done_moving(). What tools are there to diagnose further the handling of the status objects?

prjemian avatar Dec 16 '19 14:12 prjemian

What is the behavior of DMOV is the motor is already moving and then we instruct it to move again?

The monitor subscriptions are run on a background thread (one sourced via libca and then trampolined to python).

The conflict is between the current move and the previous move's clean up code.

tacaswell avatar Dec 16 '19 17:12 tacaswell

If the .VAL field is changed while the motor is moving (.DMOV = 0), the .DMOV field will remain at zero until the new move is finished.

prjemian avatar Dec 16 '19 20:12 prjemian

The monitor subscriptions are run on a background thread (one sourced via libca and then trampolined to python).

What is trampolined?

The conflict is between the current move and the previous move's clean up code.

How can I diagnose the code in the current master at this level?

prjemian avatar Dec 16 '19 22:12 prjemian

Here's an example of the timing of changes in motor VAL, MOVN, and DMOV fields using this command: camonitor sky:m1.{VAL,MOVN,DMOV}

before the move

sky:m1.VAL                     2019-12-14 06:23:54.200602 -0.1  
sky:m1.MOVN                    2019-12-14 06:23:54.200602 0  
sky:m1.DMOV                    2019-12-14 06:23:54.200602 1  

step by 1.0

sky:m1.DMOV                    2019-12-14 06:23:54.200602 0  
sky:m1.VAL                     2019-12-16 17:07:04.776139 0.9  
sky:m1.MOVN                    2019-12-16 17:07:04.776439 1  
sky:m1.MOVN                    2019-12-16 17:07:06.048334 0  
sky:m1.DMOV                    2019-12-16 17:07:06.048334 1

step by 1.0, then step by 1.0 during the first move

sky:m1.DMOV                    2019-12-16 17:07:06.048334 0  
sky:m1.VAL                     2019-12-16 17:07:23.618395 1.9  
sky:m1.MOVN                    2019-12-16 17:07:23.620621 1  
sky:m1.VAL                     2019-12-16 17:07:24.163468 2.9  
sky:m1.VAL                     2019-12-16 17:07:24.840445 2.9  
sky:m1.MOVN                    2019-12-16 17:07:24.840445 0  
sky:m1.MOVN                    2019-12-16 17:07:24.840492 1  
sky:m1.MOVN                    2019-12-16 17:07:26.143724 0  
sky:m1.DMOV                    2019-12-16 17:07:26.143724 1

step by -1.0 (with backlash correction)

sky:m1.DMOV                    2019-12-16 17:11:47.487712 0  
sky:m1.VAL                     2019-12-16 17:12:13.615191 1.9  
sky:m1.MOVN                    2019-12-16 17:12:13.615351 1  
sky:m1.MOVN                    2019-12-16 17:12:14.883530 0  
sky:m1.MOVN                    2019-12-16 17:12:14.883580 1  
sky:m1.MOVN                    2019-12-16 17:12:15.184805 0  
sky:m1.DMOV                    2019-12-16 17:12:15.184805 1  

step by -1.0 (with backlash correction), then step by -1.0 during the first move

sky:m1.DMOV                    2019-12-16 17:12:15.184805 0  
sky:m1.VAL                     2019-12-16 17:12:26.397074 0.9  
sky:m1.MOVN                    2019-12-16 17:12:26.397250 1  
sky:m1.VAL                     2019-12-16 17:12:26.997775 -0.1  
sky:m1.VAL                     2019-12-16 17:12:27.622810 -0.1  
sky:m1.MOVN                    2019-12-16 17:12:27.622810 0  
sky:m1.DMOV                    2019-12-16 17:12:27.622810 0  
sky:m1.MOVN                    2019-12-16 17:12:27.622865 1  
sky:m1.MOVN                    2019-12-16 17:12:28.925954 0  
sky:m1.MOVN                    2019-12-16 17:12:28.926168 1  
sky:m1.MOVN                    2019-12-16 17:12:29.226441 0  
sky:m1.DMOV                    2019-12-16 17:12:29.226441 1  

prjemian avatar Dec 16 '19 23:12 prjemian

What is trampolined?

The code for this now lives in https://github.com/bluesky/ophyd/blob/master/ophyd/_dispatch.py . Due to how contexts work in libca/pyepics (this is the c-level structure that holds the various sockets / threads etc required to handle the asynchronous part of the epics protocol and manages the socket <-> virtual circuit <-> channel mapping) running python code in response to a monitor that does further caput/caget will fail. I do not remember exactly why this is off the top of my head (maybe @klauer can remind me ;) ). Instead the function we actually register with pyepics just "bounces" the values to a second queue that we process in a python thread which gets around this problem.


I think we just have to start putting in logging statements and sort out the timing between the things happening on different threads and at the IOC level. I don't know any shortcut to tracking down race conditions, but at least we have a way to reliably reproduce this one!

tacaswell avatar Dec 17 '19 14:12 tacaswell

Thanks for the explanation, @tacaswell ! The next thing to do is to return to the master branch and begin diagnosis with that code, to determine what needs to be changed so the motor stall condition does not recur. I'll pause further work on this branch until that diagnosis is fruitful.

prjemian avatar Dec 17 '19 15:12 prjemian

Will make a VM image available to the DAMA team.

prjemian avatar Dec 17 '19 23:12 prjemian

Do we have a clear understanding of the source of the problem? @prjemian do you have any timing diagrams of what the various PVs are doing when it does and does not work correctly?

Virtual Circuit Unresponsive from EPICS libca: See the three adjacent comments starting with: https://github.com/bluesky/ophyd/pull/787#issuecomment-568296988

prjemian avatar Dec 22 '19 20:12 prjemian

Possibly related to #788? Will test new ophyd release 1.4.0 to see if this remains a problem.

prjemian avatar Apr 04 '20 14:04 prjemian

Need to stress test this with current release to see if still a problem.

prjemian avatar Jul 22 '20 18:07 prjemian

dropping this branch & PR

prjemian avatar Mar 29 '23 21:03 prjemian