FluidFramework icon indicating copy to clipboard operation
FluidFramework copied to clipboard

Misbehaving driver can cause Fluid to hang on container open

Open zagriswo opened this issue 2 years ago • 7 comments

Describe the bug

We found a bug in our driver that resulted in Fluid effectively busy-looping and causing an app hang. We can fix the driver bug, but it would be good to also have the container loading code be a bit more defensive too.

Our driver returned all the messages the service had via the IDocumentDeltaConnection.initialMessages property, but this set of messages erroneously had a gap in the middle. DeltaManager would go through its fetchMissingDeltas path to try to retrieve the messages in the gap, but our implementation of IDocumentDeltaStorageService.fetchMessages would successfully return an empty stream (that is, no messages and done: true) when asked about that gap. This caused DeltaManager to try to keep fetching the gap over, and over, and over, without making forward progress as it would get back an empty stream each time it tried to fill in the gap.

zagriswo avatar Nov 21 '23 20:11 zagriswo

FYI, @rajatch-ff

scarlettjlee avatar Nov 21 '23 22:11 scarlettjlee

There is no way for the container to progress until it fills the gap with the ops because the state would not be consistent then. We cannot skip ops and just proceed.

jatgarg avatar Nov 21 '23 23:11 jatgarg

Agreed on not "just proceed" but there should be some sort of additional check that results in load erroring out. If the driver returned success but no messages, should that be an immediate error? Should there be a limit on how many attempts are made for the same gap?

zagriswo avatar Nov 22 '23 00:11 zagriswo

@zagriswo we'll improve the checks here, work backlogged.

kashms avatar Mar 29 '24 19:03 kashms

Agreed on not "just proceed" but there should be some sort of additional check that results in load erroring out. If the driver returned success but no messages, should that be an immediate error? Should there be a limit on how many attempts are made for the same gap?

Question: In your original description, why did the service kept returning 0 messages and made the container to stuck? Did it ever proceed, or the service lost the messages somehow?

jatgarg avatar Apr 03 '24 21:04 jatgarg

@jatgarg it was a bug uncovered by fuzzing. Basically, a hole was made in the op stream, so our driver returned 0 messages in perpetuity because those messages just didn't exist anymore.

zagriswo avatar Apr 03 '24 21:04 zagriswo

In ODSP driver, we already handle this issue where if we don't make progress in fetching ops using delta storage service, then we give up after 30 secs and container closes. We use this public utility api for that: requestOps() https://github.com/microsoft/FluidFramework/blob/e9ba4de75adac95337ef57d7955c9c4975345eb8/packages/loader/driver-utils/src/parallelRequests.ts#L545

You can see the usage of it in ODSP driver here: https://github.com/microsoft/FluidFramework/blob/e9ba4de75adac95337ef57d7955c9c4975345eb8/packages/drivers/odsp-driver/src/odspDeltaStorageService.ts#L225

Let me know if you have more questions. You should be able to use it with your driver easily.

In future, we will think if we want to move this thing higer up the stack and in loader/deltastream layer.

jatgarg avatar Apr 13 '24 01:04 jatgarg

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!