Misbehaving driver can cause Fluid to hang on container open
Describe the bug
We found a bug in our driver that resulted in Fluid effectively busy-looping and causing an app hang. We can fix the driver bug, but it would be good to also have the container loading code be a bit more defensive too.
Our driver returned all the messages the service had via the IDocumentDeltaConnection.initialMessages property, but this set of messages erroneously had a gap in the middle. DeltaManager would go through its fetchMissingDeltas path to try to retrieve the messages in the gap, but our implementation of IDocumentDeltaStorageService.fetchMessages would successfully return an empty stream (that is, no messages and done: true) when asked about that gap. This caused DeltaManager to try to keep fetching the gap over, and over, and over, without making forward progress as it would get back an empty stream each time it tried to fill in the gap.
FYI, @rajatch-ff
There is no way for the container to progress until it fills the gap with the ops because the state would not be consistent then. We cannot skip ops and just proceed.
Agreed on not "just proceed" but there should be some sort of additional check that results in load erroring out. If the driver returned success but no messages, should that be an immediate error? Should there be a limit on how many attempts are made for the same gap?
@zagriswo we'll improve the checks here, work backlogged.
Agreed on not "just proceed" but there should be some sort of additional check that results in load erroring out. If the driver returned success but no messages, should that be an immediate error? Should there be a limit on how many attempts are made for the same gap?
Question: In your original description, why did the service kept returning 0 messages and made the container to stuck? Did it ever proceed, or the service lost the messages somehow?
@jatgarg it was a bug uncovered by fuzzing. Basically, a hole was made in the op stream, so our driver returned 0 messages in perpetuity because those messages just didn't exist anymore.
In ODSP driver, we already handle this issue where if we don't make progress in fetching ops using delta storage service, then we give up after 30 secs and container closes. We use this public utility api for that: requestOps() https://github.com/microsoft/FluidFramework/blob/e9ba4de75adac95337ef57d7955c9c4975345eb8/packages/loader/driver-utils/src/parallelRequests.ts#L545
You can see the usage of it in ODSP driver here: https://github.com/microsoft/FluidFramework/blob/e9ba4de75adac95337ef57d7955c9c4975345eb8/packages/drivers/odsp-driver/src/odspDeltaStorageService.ts#L225
Let me know if you have more questions. You should be able to use it with your driver easily.
In future, we will think if we want to move this thing higer up the stack and in loader/deltastream layer.
This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!