Use IntrusivePointer to manage TilesetContentManager and raster overlay lifetime
Before this PR, if a Tileset was destroyed while async requests/loading were in progress, the destructor would block until all the async work wrapped up and then actually do the destruction. This was necessary because parts of the async pipeline would write to e.g. Tile, and so we can't have that Tile go away while the operation is in progress or we'd get undefined behavior.
There were two problems with this approach, though:
- It would cause a freeze, potentially for hundreds or thousands of milliseconds. So removing a Tileset mid-game, for example, wouldn't go very well.
- It relied on HTTP requests and background tasks continuing to make progress even while blocking in the destructor. This creates some weird reentrancy in the best case, and in some environments in might not really even be possible.
So this PR switches to a different approach, originally suggested by @baothientran. Tilesets can now be destroyed at any time, and the destructor will never block. It will also unload a) all renderer resources associated with every tile, b) all tile content that is not currently being used by an async operation. Later, when the async operations complete, the remaining tile content will be unloaded.
This is implemented by making TilesetContentManager reference counted, rather than strictly owned by the Tileset. When the load pipelines are idle, only the Tileset's single reference to the TilesetContentManager exists. But each async operation also counts as a reference to the TilesetContentManager. So while the Tileset is destroyed immediately, the TilesetContentManager will stick around until the last async operation finishes, and then it will be destroyed.
For Unreal (and probably others), it is essential to unload the renderer resources at Tileset destruction time, rather than at the arbitrarily-later TilesetContentManager destruction. The reason is that when we switch levels, all UObjects are garbage collected and destroyed, including the UCesiumGltfComponents created to render tiles, which are the "renderer resources" given to cesium-native. If we defer the freeing of these renderer resources until later, the level will have already unloaded, and so the pointers to the UCesiumGltfComponents will be invalid. When we try to use them - even just to release them - we'll crash. We can't extend their lifetime, either. Unreal Engine will complain loudly (i.e. raise a fatal error that causes the process to die) if any UObjects outlive the level. So the only real option is to ensure that they are freed before the Tileset destructor returns.
So that is why the Tileset destructor unloads all tiles even though the TilesetContentManager lives on. However, we still have the problem of the two types of tiles that cannot be unloaded:
- Tiles that are in the
ContentLoadingstate. An async pipeline is currently working on these, and that pipeline culminates in the content being set on the tile, the worker thread renderer resources being created, and the tile state changing to ContentLoaded. We assume that these worker thread renderer resources are not problematic to exist or create after Tileset destruction, which is definitely true in Unreal, at least (because it is illegal to create/destroy/use UObjects anywhere other than the game thread). - Tiles that are in the
Donestate but that have children that are being upsampled from this tile's data. These tiles are more problematic because they, being in the Done state, have main thread renderer resources created already. Luckily we don't need these renderer resources for upsampling, we only need the cesium-native-side tile content.
Ok, so, we want to unload the main thread renderer resources for tiles in category (2), but not the normal tile content (because it's still in use). That's pretty easy to do in principle, but makes things complicated from a tile state machine perspective, because there's no such state. The states are as follows:
- Unloaded: no content, no renderer resources
- ContentLoaded: has content, has worker thread renderer resources
- Done: has content, has main thread renderer resources
- This funny new state: has content, has no renderer resources
We can't derive worker thread renderer resources from main thread renderer resources (we can only go the other way), so once we free the main thread renderer resources, the Tile is stuck in a broken state. There's no way to get it renderable again other than by unloading the whole thing and starting over.
Now, one could argue that the state of the Tiles doesn't make much difference, because we're shutting down the whole Tileset anyway. Which is true, but this weird implicit state is still an easy source of bugs. Plus it's nice if Tileset destruction can just do a normal Tile unload rather than a special, might-leave-the-tile-broken unload.
So that's why this PR introduces a new TileLoadState: Unloading. Tiles in this state can transition back to Unloaded over multiple calls to unloadTileContent, and this method is called automatically by updateTileContent for tiles in this state. Once the tile is fully unloaded, it can be reloaded as normal. In addition to fixing the Tileset shutdown problem, this means that even tiles that are being used as a source for upsampling can be partially unloaded, which is nice.
I should mention that Bao had the interesting idea to store tile content using a thread-safe copy-on-write mechanism. That way an upsample operation can "view" the parent tile content, but if it changes (e.g. is unloaded!), the original is still available for use by the upsample thread. I like it, but worry the cost (in terms of heap allocations and atomic operations) might be a bit high. And also, in the general case, when a parent tile's glTF is modified (not just unloaded), we probably need to propagate that change to upsampled children, not just stick with the original snapshot. Still, it's worth investigating. I've just chosen not to do so here because I think it's a much bigger and riskier change than is implemented in this PR.
Raster Overlays
Raster overlays have a similar situation. While an async RasterOverlayTile load is in progress, we must avoid deleting:
- That RasterOverlayTile
- The RasterOverlayTileProvider that created the tile.
- The RasterOverlay that created the tile provider.
However, we would like to be able to remove a RasterOverlay and have it disappear immediately, even if async loads are in progress. This already worked before this PR, but it was complicated. Now it's... still complicated. But hopefully a bit less so.
The RasterOverlayCollection is owned by the TilesetContentManager, so a RasterOverlayCollection can be deleted at any time, even while an overlay is creating a tile provider or a tile provider is loading a tile. The lifetime relationships are as follows:
-
RasterOverlayCollectionholds an intrusive pointer to all of its overlays, their corresponding placeholders, and their actual tile providers (once they're created). -
RasterOverlayCollectionholds all of the above information in a privateOverlayListstruct, which is reference counted. The RasterOverlayCollection holds an intrusive pointer to this, and async tile provider creation operations also hold a reference to it so that they can store the created tiled provider. TheOverlayListmay outlive theRasterOverlayCollectionthat created it. - While a
RasterOverlayTileis loading, the async load operation holds an intrusive pointer to both the tile being loaded and the tile provider doing the loading. -
RasterOverlayTileProviderholds an intrusive pointer to theRasterOverlaythat created it.
When the RasterOverlayCollection is destroyed, all overlays are removed. When an overlay is removed, the corresponding overlay tiles are immediately removed from all geometry tiles, and then the intrusive pointers to the overlay, tile provider, and placeholder are all dropped. This will usually result in them all being destroyed, though they may stick around a bit longer if there are any async operations (tile provider creation or tile load) in progress.
Other smaller changes:
-
RasterOverlayknows how to create aRasterOverlayTileProvider(or a placeholder), but it no longer maintains a reference to it. It's just a factory. -
RasterOverlay::createTileProvideris now const. Fixes #478
I'm marking this ready to review, and it is, but probably best not to merge it until after the October Cesium for Unreal release.
I really like this change @kring! Thanks for the detailed write-up as well, it made it much easier to understand this change.
I couldn't find any obvious leaks or issues in actual testing, but I did have one concern about this change - particularly regarding this part of the write-up:
...two types of tiles that cannot be unloaded:
- Tiles that are in the ContentLoading state. An async pipeline is currently working on these, and that pipeline culminates in the content being set on the tile, the worker thread renderer resources being created, and the tile state changing to ContentLoaded. We assume that these worker thread renderer resources are not problematic to exist or create after Tileset destruction, which is definitely true in Unreal, at least (because it is illegal to create/destroy/use UObjects anywhere other than the game thread). ...
This may no longer be true in Unreal after we added asynchronous RHI texture creation right? Suppose Unreal's shutdown procedure is something like:
1. Stop ticking
2. GC / destroy all UObjects
3. Flush render thread tasks (resource release tasks queued up by destructing UObjects)
4. Shutdown render thread
5. Shutdown RHI (maybe a separate flushing has to happen first to sync render thread and RHI)
This may be completely made up - it is only from my partial understanding of the resource dependencies across the threads and some guessing... But if this is the case, the only chance we have to safely destroy render resources is during step 1), by waiting for worker thread tasks to resolve. Previously we did not actually create proper render resources from the worker thread, so it was safe to shutdown the renderer before background tile loading completed. Now I think we need to wait for async render resource creation to finish right?
Again, don't know if this is a real concern. But if so, then RHI resource creation on the worker thread might hypothetically fail on renderer-shutdown, we would just get a crash-on-exit. If we are just refreshing the tileset however, then:
- The resource creation will be successful (since the renderer will be alive)
- So all pending tiles at the time of Tileset destruction will eventually get to ContentLoaded
- At which point the TileContentManager destructor will run.
- Which will unload all those pending tiles from the ContentLoaded state
- This will release all render resources, which should again be safe since the renderer is alive in this case.
It's very possible I missed a detail though, fwiw I haven't actually seen any such crash on exit so far.
That's a good point @nithinp7! To test it, I tried adding FPlatformProcess::Sleep(5.0f); to the top of loadTextureAnyThreadPart, started up the Editor, let some tiles load a bit, and then exited the Editor. This pretty much guaranteed that the load threads would try to do load work after the Editor shutdown process was well underway.
And it did indeed cause a crash, but not where I expected. It crashed (consistently) because PhysX had been uninitialized by the time we were trying to bake physics meshes. So, I think you're right: we need to do a bit more to ensure an orderly shutdown.
After thinking about this for a bit, I eventually remembered that Unreal already has a built-in process for asynchronously destroying UObjects: IsReadyForFinishDestroy. So, I extended cesium-native to raise an event (in the form of a SharedFuture) when a Tileset is destroyed, all async work is completed, and all tiles are unloaded. And I used this on the Unreal side to make IsReadyForFinishDestroy only return true after everything has cleaned up completely. This avoids the crash on exit (because Unreal waits for the async work before shutting everything down), but it still avoids blocking the main thread when we delete Tilesets.
I added it in #563 and https://github.com/CesiumGS/cesium-unreal/pull/958, not in this branch.
I think we still need a similar system for RasterOverlays, but I'm curious what you think of this first.
I like the idea and I'm glad we can still keep tileset destruction free of blocking! Just checking my understanding (let me know if I'm misunderstanding any part of this):
So in Unreal IsReadyForFinishDestroy returning false should keep the game systems running - so loading should continue to be safe. Unreal polls IsReadyForFinishDestroy which dispatches main thread tasks in the async system and checks the "tilesetsBeingDestroyed" counter. The dispatched main thread tasks include progressing finished tile loads into ContentLoaded, decrementing the TilesetContentManager reference count, and destroying the TilesetContentManager if the reference count is 0. Lastly, a main thread task is needed to check whether the in-progress-loads-finished promise has been resolved, and decrement the "tilesetsBeingDestroyed" counter in the Cesium for Unreal tileset if so. Once that counter is 0, IsReadyForFinishDestroy will return true, letting the Unreal tileset get destroyed - and letting Unreal shutdown its game systems if wanted without disrupting pending loads.
This looks perfect in Unreal and seems to work as expected! So no objections from me to continue this approach for raster overlays as well!
That's exactly right @nithinp7, good explanation! And thanks for reviewing it. I'll implement it for raster overlays now, too.
@nithinp7 I've fixed the two problems we talked about offline in this branch:
- Cesium ion token refresh for quantized mesh tiles
- The token troubleshooting panel wasn't popping up for tilesets (only raster overlays). This also exists in Cesium for Unreal v1.18.0.
Thanks @kring this looks solid to me! And thanks for fixing those two bugs as well!