micromdm icon indicating copy to clipboard operation
micromdm copied to clipboard

Race condition in command queue

Open jessepeterson opened this issue 5 years ago • 2 comments

What version of micromdm are you using?

HEAD on master

What micromdm command did you run?

Extensive use of API sending commands, etc.

What did you expect to see?

Correct commands sent to device.

What did you see instead?

Commands failing to get queued for the device (and subsequently sent out).


When a device's DeviceCommand struct gets too big — usually with too many completed commands — the time to load and save that data gets longer and longer. This raises a race condition between queuing commands (which is a load and save operation) and de-queueing/popping commands for sending to the device (which is also a load & save op). Essentially the two load operations happen before a single save operation and you get overwritten & incorrect DeviceCommand blocks per device.

Removing completed commands restores quick load & writes restores normal operations for me. The foreshadowing of these line came true:

https://github.com/micromdm/micromdm/blob/91c236c8c3a2fa39d02f3e0932ee009e31ad8a48/platform/queue/device_command.go#L30-L33

jessepeterson avatar Dec 27 '18 23:12 jessepeterson

As a temporary workaround until a longer term solution is come up with, this little patch will clear the completed command queue every time the command queue is popped:

--- a/platform/queue/queue.go
+++ b/platform/queue/queue.go
@@ -118,6 +118,8 @@ func (db *Store) nextCommand(ctx context.Context, resp mdm.Response) (*Command,
                }
        }
 
+       dc.Completed = []Command{}
+
        if err := db.Save(dc); err != nil {
                return nil, err
        }

With the device struct being a much more manageable size (by not having every completed command saved) the save and load operations from BoltDB happen quick enough that pushing/popping commands no longer race.

jessepeterson avatar Feb 06 '19 21:02 jessepeterson

Could this be the case for my endless loop with the enrollment profile getting removed as soon as the deviceComplete is sent? Why do one want to save the completed que when web hooks are there to answer every que event? At least there should be an cleaning operation on a schedule or by command.

tomaswallentinus avatar Nov 11 '19 08:11 tomaswallentinus