cesium-unity icon indicating copy to clipboard operation
cesium-unity copied to clipboard

More reliable AppDomain reloads

Open kring opened this issue 3 weeks ago • 4 comments

Whether via recent changes to cesium-native or to Unity (or both), Cesium for Unity has become increasingly unstable during AppDomain reloads. Crashes are fairly frequent when enter play mode or editing scripts while tilesets are in the process of loading.

#628 should help on this front, but it's still pretty easy to cause crashes.

@david-lively added some code in the unity-domain-reload that should help with this. The idea is that we detect when the AppDomain is about to unload (via AssemblyReloadEvents.beforeAssemblyReload), and then we block, waiting for any pending asynchronous work to complete before continuing. This ensures that work started in one AppDomain won't finish in a different AppDomain, which is disastrous because managed references become invalidated (and there's no reliable way to even check if a reference is valid, AFAICT).

One downside to this is that the blocking makes AppDomain reload slower, which isn't great. But that's definitely better than crashing!

The bigger problem, though, is that it doesn't quite work. For most async work, such as a thenInWorkerThread continuation, we just need to wait a bit, while making sure we process main thread tasks, and eventually, the work will finish. But unfortunately this is not true for network requests. We use UnityWebRequest to download tiles (etc) from the web, and this class it tied up with Unity's game loop. If the game loop doesn't run (because we're blocked in beforeAssemblyReload), then the network requests will never complete. Our "wait for async work to complete" will actually end up being "wait forever."

I think the best way to solve this will be to cancel any pending network requests in beforeAssemblyReload for waiting for the async work to wrap up. If we can make that work, the AppDomain reload should be faster (because we don't need to wait for pending network requests), and we won't ever block forever. It's a little tricky though.

The first problem is that we don't have any way to enumerate the network requests so that we can cancel them. To solve that, I suggest we modify UnityAssetAccessor to keep track of all requests. We can do that efficiently by storing them using the intrusive DoublyLinkedList class. The UnityAssetAccessor::get and request methods should create the UnityAssetRequest instance earlier. Currently, they only do it in the completed callback. That will require changes to UnityAssetRequest to allow it to be constructed before it has a valid UnityAssetResponse. Then, the request object should be added to the DoublyLinkedList. All modifications of the linked list should be done under a std::mutex.

In the UnityAssetRequest destructor, it should remove itself from the linked list (again under std::mutex lock).

Now, in beforeAssemblyReload, we can walk through the linked list of requests and cancel all of them. This can be done by calling Abort on the UnityWebRequest (UnityAssetRequest will need to be modified to hold a reference to this instance), but most likely that won't work. An aborted request probably won't call completed until we get back to the game loop.

So, we should do our own cancelation. Mark the UnityAssetRequest canceled (a flag will do), and then also reject the promise directly. If and when Unity raises the completed callback, we should ignore it for a request that has been canceled.

Be careful about aborting a request that has already completed. Checking isDone is probably not sufficient - we want to know if the completed callback has already been raised. Another flag, I guess. Or a state enumeration.

kring avatar Nov 14 '25 10:11 kring

To add to this topic: My team has been working with Cesium for a considerable amount of time. This issue of AppDomain reload crashes use to be the bane of our existence. This problem has been going on for years. Before we decided to take a nuclear option in dealing with this, our dev team counted 7 crashes per day per developer when working on our product.

However, currently we managed to fix this issue with a fork we maintain internally. We now experience a crash from the system about once every month or so :D. Because of our customers use cases, we cant use Cesium Ion. As such, we removed it entirely from the Cesium Native implementations. This fixed everything. Specifically the changes you can see here: CesiumIonSessionImp.cpp

Obviously this is not an actual solution, but hopefully it can provide some data to narrow down the instability issues. I would love to move off of this fork we maintain and back to the main repo :)

Reag avatar Nov 16 '25 09:11 Reag

Hi @Reag, thanks for the information. What version of Cesium for Unity are you forked off of? I think that we used to have some AppDomain related stability problems that were triggered by the Cesium ion connection. These days, though, those problems are (mostly? entirely?) fixed, and instead we have (much worse?) AppDomain problems related to tile loading. The new problems are a result of changes in cesium-native or perhaps Unity 6. Those changes aren't wrong in and of themselves, but they conspire to cause problems in the very tricky process of reloading the AppDomain while native code is running. We're working on it!

kring avatar Nov 24 '25 09:11 kring

Currently, we have version 1.17 forked, which seems pretty stable. I think we've been maintain this since around 1.14 or so? Roughly? Right now we have some pretty good stability. We do occasionally see a crash related to Cesium (from something asset database related), but these crashes occur very infrequently. I can try using the latest branch for a bit in my dev environment to see if the problem is solved on our end. Ill write back if I find out anything interesting.

Reag avatar Nov 24 '25 09:11 Reag

Thanks, I appreciate you investigating!

kring avatar Nov 24 '25 11:11 kring

I was getting a lot of editor crashes while testing. I noticed at least some of the crash stacktraces were from a corrupt pointer returned from ActivatedRasterOverlay::getTileProvider().

I added code to null the pointers in the ~ActivatedRasterOverlay destructor,

  this->_pOverlay = nullptr;
  this->_pPlaceholderTileProvider = nullptr;
  this->_pTileProvider = nullptr;
  this->_readyPromise = nullptr;
  this->_readyEvent = nullptr;

and it seems like it's been a lot more stable since I did that. My gut feeling is that ActivatedRasterOverlay got destructed elsewhere but is still being used, and at least by nulling the pointer members, when the tile provider is gotten it will be null and the downstream code will properly null check that. Of course it's a band-aid "solution", the fundamental issue is that ActivatedRasterOverlay is being used after death.

(I fully expect it to start crashing on me again after I write this, that happens a lot)

brendan-duncan avatar Nov 26 '25 15:11 brendan-duncan

That wasn't quite the right answer, just got a crash in ~RasterOverlayTile, from the assigning of nullptr to the IntrusivePointers. I'll switch to using reset().

brendan-duncan avatar Nov 26 '25 21:11 brendan-duncan