FluidFramework icon indicating copy to clipboard operation
FluidFramework copied to clipboard

Infinite waits while resolving data stores / requests

Open vladsud opened this issue 4 years ago • 13 comments

Problem Statement

ContainerRuntime.getDataStore() and ContainerRuntime.request() result in infinite wait when request hits non-existing data store. It will be resolved if/when container eventually processes attach op for such data store. However if request is made for wrong (non-existing) data store, such request deadlocks. This also shows up in end-to-end workflows where container is created paused, and URI resolution with waits will result in deadlock. Thus host may need to implement multi-stage process of doing initial request within waits, and on 404, resume container and retry request with infinite wait.

Proposed solutions

Use waitContainerToCatchUp() mechanism

Use it as a preparation step before resolving URI (or as a step in above mentioned algorithm after getting 404) as a wait to wait for container to catch up. This will fundamentally not change how resolution is made, but will allow us to remove waits from getDataStore() and request resolution.

Incorporate sequence number into URIs

When URI is generated, client can add (as an optional argument) current reference sequence number. This would allow (or rather - require) client resolving URI to get up to this sequence number before finishing resolving request. The nice pro of this design is that it has better user experience (or rather allows to provide better user experience) - we can decide to show progress UI or some other indication while we catch up, for user to better understand why it takes long to get to what this URI represents, and now show partial or previous state of document / component. The potential con of this approach is that URIs can be generated only for content that actually is in the file, i.e. was acked by server. While it looks like con, it will also not allow a scenario where user is offline, creates URI and sends it over Teams or email, where URI reaches consumers without referenced fluid content eve making it to a file (note that background sync of email will happen on some systems when laptop/phone looks off, so browser may be suspended and fluid content not synching, but URI may make it through just fine). As result user may never be able to resolve such URI. Pushing this problem to producer has certain pros and cons, but is in line with work in other areas, in particular sequential IDs to represent ranges of text in Scriptor and URIs being generated only for online content.

vladsud avatar Dec 02 '20 17:12 vladsud

@leeviana , @montselozanod , @anthony-murphy , @curtisman

I think we should go with second approach. Would be great to add more people to discussion from app side to provide their feedback.

vladsud avatar Dec 02 '20 17:12 vladsud

I like the second approach. :) I think offline mode was always going to be a "con" for whatever solution we ended up here/with similar challenges and we'll need to work out our vision/support there in a separate larger effort/architecture design. Not sure who on the app side cares too much about this right now (mostly just affects future offline thinking), but lets start with @AndreiZe

leeviana avatar Dec 02 '20 18:12 leeviana

Personally, I like them both. I view sequence number as a hint, but for scenarios listed above, and because it's user modifiable we can never fully trust it, but if it's there and reasonable we can leverage it. However, we'll likely always want a fallback, and waiting for catch up seems appropriate

anthony-murphy avatar Dec 02 '20 18:12 anthony-murphy

URL is the identity of the reference, so having the minimum sequence number that the URL Is valid make sense. It is very likely that changing a URL will render it useless, so I don't think it necessary a hint.

curtisman avatar Dec 02 '20 21:12 curtisman

I'll use this uber task for all related issues (and dedup the rest).

Here is what we need to do:

  1. Remove waits from getRootDataStore / ContainerRuntime.request() APIs (former #3875)
  2. Add UT coverage for waitContainerToCatchUp() (former #3874)
  3. Requests against paused container may result in infinite wait if component requests exists in future (ops only), not snapshot (former #3085)

Not really related, but occasionally mentioned in similar topics:

  1. request for a URL with no handler throws: #2859
  2. IContainerRuntime*.get*DataStoreRuntime: Supply optional AbortSignal for cases where we wait: #1192

vladsud avatar Feb 26 '21 23:02 vladsud

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

ghost avatar Dec 01 '22 03:12 ghost

not stale

leeviana avatar Dec 01 '22 21:12 leeviana

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

not stale

leeviana avatar Feb 01 '23 18:02 leeviana

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

not stale

leeviana avatar Aug 01 '23 21:08 leeviana

not stale

leeviana avatar Jan 29 '24 20:01 leeviana

not stale

leeviana avatar Jul 29 '24 17:07 leeviana