3d gauss splatting 3dtiles lod render error
What happened?
bug3: demo - 副本 - 副本.html:1 Access to XMLHttpRequest at 'http://ecn.t1.tiles.virtualearth.net/tiles/a1321032332031203311.jpeg?n=z&g=15420' from origin 'http://127.0.0.1:5503' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. Cesium.js:7476 An error occurred in "op": Failed to obtain image tile X: 437455 Y: 212652 Level: 18. Cesium.js:75 GET http://ecn.t1.tiles.virtualearth.net/tiles/a1321032332031203311.jpeg?n=z&g=15420 net::ERR_FAILED 504 (Gateway Time-out)
how to resolve these probloms?
Reproduction steps
...
Sandcastle example
No response
Environment
Browser: CesiumJS Version: Operating System:
Hi @cuijianzhu,
Thanks for this report. I think we will need more information to help us understand the issue. Are you able to provide us a working example with our sandcastle tool that reproduces the issue?
It may also help us to know:
- The environment (Browser, CesiumJS version, OS version) you are using
- Since the title of your post indicates you are trying to load a 3D Tileset with gaussian splats, can you tell us if you produced this tileset with Cesium ion or other means?
Thanks and I hope we can resolve soon when we have more information.
hi, @lukemckinstry you can see my 3d gauss splatting model in sandcastle tool now, please wait for few minutes. https://sandcastle.cesium.com/#c=bZBNawIxEIb/StiTFkkQDwVdpWB7K/Sg7SmXmB11aj6WTHatlv73Zr/AYk/DzLzPOx9oSx8ie2CK2BoIK8v2wVsmM91mMltIJ532jiKrEc4Q2JI5OPdq/tHWRoN+7V1U6CDIbLwYuIgGCGIC1VlhHNAuzJ63XZs3g9+DGUnH0vxjjCXNhXAHdF/8UJ3U1SoefSlmRWv4OBW9Mf8k72Q2acBvVlycsqg3OgC4Tak0vITgw5ztlSFgP0nWrNYdw0mDA14GtBixBuKqKEa977i9vRdevbdbf9vKJllO8WJg1Qx+wu6VVbqAcxHBlkZFILGr9CntqIkaKBcDkhdYMyyW/7yOaaOIUmdfGbPBK8hslYuk/4MZrwp0h7caglGXRnKcrl67Iuc8Fym9p6L3ZqfCjeMv
hi, @lukemckinstry , you can see my 3d gauss splatting model in drive.google. https://drive.google.com/file/d/18WlCtQeEqUIJjWDYa2kdrlHpuF_zvGOK/view?usp=sharing
(The CORS point seems to be unrelated (or in doubt, should be tracked as a separate issue))
The hint to "wait a few minutes" refers to the fact that the server is pretty slow, and the GLB files are pretty large.
There are several things coming together here. There currently is not yet a full support for real "Level Of Detail" for splat data in CesiumJS. On top of that, the number of splats that should be displayed here is probably just too large. The first level of detail (below the root) consists of roughly 50MB of data, and a rough(!) ballpark estimate is that this will contain maybe 5 million splats. For comparison: The whole data set from the demo at https://sandcastle.cesium.com/index.html?id=3d-tiles-gaussian-splatting contains about 550000 splats. It's not clear what the limit is for the number of splats. We've seen similar error messages for ~"slightly more than one million splats". And in any case, I don't think that there is any mechanism that prevents GL errors when there are "too many" splats. So... there are several things to be investigated and to be taken into account in the ongoing development.
I created some dummy test data to reproduce the issue:
The archive contains three tilesets with splats. Each of them is a quadtree with 3 levels. In the first one, each tile has 163 splats. In the second one, each tile has 323 splats, In the third one, each tile has 643 splats. So in the third one, when the leaf tiles should be displayed, there are 4194304 splats. The archive also contains a sandcastle to load these tilesets. When zooming into the third one, one can see where things wreak havoc:
The smaller ones still work, so it's unlikely that this is a matter of ~"invalid files" or something like that. (Although I did have to apply some tweaks and fixes to bring the data into a shape that CesiumJS likes...).
The performance for the 4 million splats is dismal. It claims 30 FPS, but ... while I'm typing this, the typing is laggy - i.e. that open window (without any interaction) slows down the browser and the whole OS. At least it's warm here: The GPU usage of my RTX2070 is pinned to 100%, and the cooling fan goes brrrrr. In how far this performance issue is the result of that rendering error may have to be investigated, but without any interaction, I'd expect it to not keep the GPU at 100%.
@keyboardspecialist @weegeekps Do either of you have any rough estimate on what the upper-bound of number of splats the sorting module is expected to handle?
I'd expect that to also depend (to some extent, probably, at least) on the shpherical harmonics degree. For the given data, the degree is 0. With a higher number of SHs to be stored, the limit may be lower.
A quick test with a single (root) tile with 4194304 splats does not seem to generate this error. The performance is still dismal, but the error does not seem to appear. As mentioned elsewhere, I couldn't imagine a reason for that
Vertex buffer is not big enough for the draw call.
given that all this should be far away from any limits that the GL implementations should support. So from the symptoms, the next (still somewhat wild) guess is that this is a race condition where it "selects" the finer tiles with their additional splats, and tries to squeeze them into a previously allocated (smaller) buffer. But I have not yet spent time investigating the cause (that involves reading code). I just created the test data (that only involves writing code 🙂 ).
For what it's worth: The plot that this is rather a race condition thickens: After a few tries, I could reproduce the error with the official sandcastle at https://sandcastle.cesium.com/index.html?id=3d-tiles-gaussian-splatting , by rotating the view while the tiles have still been loading. Something about that sorter or texture generation, probably...
I tried to zoom into that a little bit, but will stop here. The latest state that I've been tracking to down to is that it generates a draw command here that uses inconsistent instance counts and buffer sizes, i.e. a debug log
const expected = vertexArrayCache._attributes[1].vertexBuffer._sizeInBytes / 4;
const actual = renderResources.instanceCount;
console.log(`Draw command with ${expected} and ${actual}`);
inserted there prints
Draw command with 262144 and 262144 (printed many times)
Draw command with 262144 and 3407872
There it is
[.WebGL-0x3bd40e140000] GL_INVALID_OPERATION: ... Vertex buffer is not big enough for the draw call
Draw command with 4194304 and 4194304 (printed many times)
@keyboardspecialist @weegeekps Do either of you have any rough estimate on what the upper-bound of number of splats the sorting module is expected to handle?
High. My stress testing that I checked in sorts about 2.2 million splats. According to my notes I did test with up to 5 million.
The GPU usage of my RTX2070 is pinned to 100%, and the cooling fan goes brrrrr.
To me, this indicates that it's not a sorting issue, but in fact something else. My guess is writing/updating the texture or something else causing blocking for a brief moment.
There are several interconnected issues here.
Figuring out what the upper number is for the sorting procedure alone could be worthwhile. After all, when this number is exceeded, then "nothing will work", so this has to be imposed as a hard limit in the traversal/selection/textureGeneration process. Right now, there is no such check. The texture generation has a limit as well: A tileset with (4 leaf nodes with) 16384000 total splats still works. A tileset with 28311552 total splats fails with a hard crash when trying to allocate the texture. (Details may be machine-dependent).
The sorting is slow. One could say that there's not so much that we can do about this. But there is one thing that we can do, namely, sorting as rarely as possible. This is not the case. A debug log in this line indicates that it is sorting in every frame, even without interaction(!) (and note that there's another place where the sorting is triggered, a few lines below that)
There is almost certainly a race condition.
Maybe my first comments here have been misleading. It first appeared to be an issue with the number of splats. But apparently, a draw command is created with an inconsistent state, and the reason is (handwavingly) a race condition that is likely related to the sorter promise handling. (It's hard to be sure about that. I'm skipping some parts of a "rant" that I already posted elsewhere, and for now just say the state changes in that update function should be reviewed and documented...). Roughly: It has to be made sure that the vertexArray and the renderResources refer to the same state. This will likely involve time-consuming reverse engineering to figure out what the code is currently doing and which part of the primitive (and the things that are stored in the draw command) change "asynchronously".
sorting in every frame, even without interaction
Yeah, that's a major issue then. Outside of using WebGPU to sort the splats, there's not much we can do to speed it up further, but I welcome anyone who has ideas on how to further optimize the sort.
Regarding the sort, one thing that I thought about the other day is to verify if we are passing in the previously sorted splats or if we're resorting from the original state. I expect given we're using a radix sort that if we pass in the previously sorted splats, it should be faster than sorting from the original list every time we need to sort.
This will likely involve time-consuming reverse engineering
@keyboardspecialist is there anything you can provide here to clarify the state changes in update?
sorting in every frame, even without interaction
Yeah, that's a major issue then.
I had another short look. The check at https://github.com/CesiumGS/cesium/blob/bcc5ea383e294c698940dbf9df76ac5ba6961a73/packages/engine/Source/Scene/GaussianSplatPrimitive.js#L839 should probably be done with some epsilon. Right now, some debug log shows
not equal
current view matrix Matrix4 {0: 0.9666075230030049, 1: 0.022568315075562506, 2: 0.25526568047555004, 3: 0, 4: 0.2562613831110639, 5: -0.08512676732123603, 6: -0.9628517731257709, 7: 0, 8: -2.7294139171019077e-14, 9: 0.996114503779604, 10: -0.0880675613373827, 11: 0, 12: 1.2235839463353564e-7, 13: -4488998.596192326, 14: -4519963.746482325, 15: 1}
previous view matrix Matrix4 {0: 0.9666075230030049, 1: 0.022568315075562423, 2: 0.25526568047555004, 3: 0, 4: 0.2562613831110639, 5: -0.08512676732123603, 6: -0.9628517731257709, 7: 0, 8: -2.7214341891124143e-14, 9: 0.996114503779604, 10: -0.0880675613373827, 11: 0, 12: 1.220333408587832e-7, 13: -4488998.596192326, 14: -4519963.746482325, 15: 1}
Yeah. They are not equal. Changing that line to something like
Matrix4.equalsEpsilon(frameState.camera.viewMatrix, this._prevViewMatrix, 1e-6)
already prevents the initial repeated sorting - details TBD.
(Why is the view matrix changing without interaction? Nobody knows. See https://github.com/CesiumGS/cesium/issues/11877 for a related issue).
Regarding the sort, one thing that I thought about the other day is to verify if we are passing in the previously sorted splats or if we're resorting from the original state
To my understanding (which is still rather shallow from just browsing over the code), there is no sorting of the positions at all. The sorted order is computed, but that order is never "applied" to some float positions[n * 3] array. The sorted indices are passed to the shader, for accessing the right pixel for the texture.
clarify the state changes
Depending on who is going to spend how much time for what, this may not be applicable. Right now, there are 38 (thirty-eight) assignments of the pattern this\..* = (i.e. modifications of the state of this object) in the update function alone, and it's not easy to figure out which of them may assign a wrong value to a variable, under which conditions, and at which point in time (!), that causes the apparent inconsistency for the draw command (where the creation of the draw command itself does further modifications).
there is no sorting of the positions at all
Sorry, I wasn't very clear. Yes, we don't sort the positions but rather the indices and then use those to quickly access the actual position data. I would expect the sorted indices to be somewhat stable between two frames (especially if there isn't much movement between the frames) so if we are resorting the indices each and every time from a beginning state that could be a performance issue too. It's been a while since I looked closely at this code so this could be irrelevant.
A quick test with a single (root) tile with 4194304 splats does not seem to generate this error. The performance is still dismal, but the error does not seem to appear. As mentioned elsewhere, I couldn't imagine a reason for that
Vertex buffer is not big enough for the draw call.
given that all this should be far away from any limits that the GL implementations should support. So from the symptoms, the next (still somewhat wild) guess is that this is a race condition where it "selects" the finer tiles with their additional splats, and tries to squeeze them into a previously allocated (smaller) buffer. But I have not yet spent time investigating the cause (that involves reading code). I just created the test data (that only involves writing code 🙂 ).
This problem is unrelated to the number of Gaussian points; simply adding this one line of code will eliminate the error.
The problem itself did receive some attention. And we already noticed that (contrary to what I originally suspected) the reason is not the number of splats, but a race condition.
The other issue ( https://github.com/CesiumGS/cesium/issues/12965 ) may have been perceived as a duplicate of this one. It contains a ~"suggested fix", but ... it is still only an issue, and not a pull request. The suggested fix is not so clear either. The statement only was that "a certain line has to be added", with no indication where and why. (And an aside: I'm certainly not trying to do pattern-matching based on some screenshot of code. In doubt, rather post a permalink to the actual code on GitHub where that line has to be inserted).
Maybe @keyboardspecialist can/wants to check whether that line, inserted at the right place, is suitable for resolving the issue. I'd say that the state handling of the update function should be reviewed, probably refactored, and certainly documented (extensively), but that might be considered to be some ~"follow-up issue" for the people of the future.
The problem itself did receive some attention. And we already noticed that (contrary to what I originally suspected) the reason is not the number of splats, but a race condition.
The other issue ( #12965 ) may have been perceived as a duplicate of this one. It contains a ~"suggested fix", but ... it is still only an issue, and not a pull request. The suggested fix is not so clear either. The statement only was that "a certain line has to be added", with no indication where and why. (And an aside: I'm certainly not trying to do pattern-matching based on some screenshot of code. In doubt, rather post a permalink to the actual code on GitHub where that line has to be inserted).
Maybe @keyboardspecialist can/wants to check whether that line, inserted at the right place, is suitable for resolving the issue. I'd say that the state handling of the
updatefunction should be reviewed, probably refactored, and certainly documented (extensively), but that might be considered to be some ~"follow-up issue" for the people of the future.
Sorry, this was my oversight. Recently I’ve been refactoring GaussianSplatPrimitive to improve sorting performance. Because my modifications differ quite a lot from the original code, I don’t plan to submit a PR for this issue at the moment. I only wanted to point out the problem in case someone else is able to submit a PR.
The problematic line is here: https://github.com/CesiumGS/cesium/blob/3d78834590b2416c7bb369c245be4b172b5cf4f7/packages/engine/Source/Scene/GaussianSplatPrimitive.js#L1021 This line is missing a state assignment. You can compare it with this line: https://github.com/CesiumGS/cesium/blob/3d78834590b2416c7bb369c245be4b172b5cf4f7/packages/engine/Source/Scene/GaussianSplatPrimitive.js#L1049
Next, I’ll explain in detail the cause of the issue and why adding this line fixes the problem. My English is not very good, so the explanation may not be perfectly clear—thanks for your understanding.
Every time Cesium updates the scene, it calls the update function of GaussianSplatPrimitive. Inside update, when the view changes but the selected tiles do not, the material will not be regenerated, but sorting will still run, which triggers this line:
https://github.com/CesiumGS/cesium/blob/3d78834590b2416c7bb369c245be4b172b5cf4f7/packages/engine/Source/Scene/GaussianSplatPrimitive.js#L1021
Sorting takes some time. If the view changes again before sorting completes, the update function will still execute this line. This effectively becomes:
this._sorterPromise.then(() => setIndexes).then(() => setIndexes)
When the view changes rapidly, because no state is being set, the update function adds multiple then chains to _sorterPromise during an ongoing sort. As we know, in JavaScript, all added then callbacks will be executed. This means that before the next correct sorting result is applied, the previous sorting process may trigger several stale then callbacks, causing the previous sorting results to “pollute” the new ones. Since the length of sorting results can differ each time, using an outdated sorting result can lead to indexes being too long or too short, which produces the error.
Why does adding this line fix the issue?
Because once the line is added, the update function will not execute this._sorterPromise.then while sorting is still in progress. This prevents additional then chains from being attached, and therefore prevents the sorting results from being polluted.
I’ve already tested this. After setting the state, the error no longer occurs.
The problem itself did receive some attention. And we already noticed that (contrary to what I originally suspected) the reason is not the number of splats, but a race condition. The other issue ( #12965 ) may have been perceived as a duplicate of this one. It contains a ~"suggested fix", but ... it is still only an issue, and not a pull request. The suggested fix is not so clear either. The statement only was that "a certain line has to be added", with no indication where and why. (And an aside: I'm certainly not trying to do pattern-matching based on some screenshot of code. In doubt, rather post a permalink to the actual code on GitHub where that line has to be inserted). Maybe @keyboardspecialist can/wants to check whether that line, inserted at the right place, is suitable for resolving the issue. I'd say that the state handling of the
updatefunction should be reviewed, probably refactored, and certainly documented (extensively), but that might be considered to be some ~"follow-up issue" for the people of the future.Sorry, this was my oversight. Recently I’ve been refactoring
GaussianSplatPrimitiveto improve sorting performance. Because my modifications differ quite a lot from the original code, I don’t plan to submit a PR for this issue at the moment. I only wanted to point out the problem in case someone else is able to submit a PR.The problematic line is here:
cesium/packages/engine/Source/Scene/GaussianSplatPrimitive.js
Line 1021 in 3d78834
this._sorterPromise.then((sortedData) => {
This line is missing a state assignment. You can compare it with this line: cesium/packages/engine/Source/Scene/GaussianSplatPrimitive.js
Line 1049 in 3d78834
this._sorterState = GaussianSplatSortingState.SORTING; //set state to sorting Next, I’ll explain in detail the cause of the issue and why adding this line fixes the problem. My English is not very good, so the explanation may not be perfectly clear—thanks for your understanding.
Every time Cesium updates the scene, it calls the
updatefunction ofGaussianSplatPrimitive. Insideupdate, when the view changes but the selected tiles do not, the material will not be regenerated, but sorting will still run, which triggers this line:cesium/packages/engine/Source/Scene/GaussianSplatPrimitive.js
Line 1021 in 3d78834
this._sorterPromise.then((sortedData) => { Sorting takes some time. If the view changes again before sorting completes, the
updatefunction will still execute this line. This effectively becomes:this._sorterPromise.then(() => setIndexes).then(() => setIndexes)When the view changes rapidly, because no state is being set, the
updatefunction adds multiplethenchains to_sorterPromiseduring an ongoing sort. As we know, in JavaScript, all addedthencallbacks will be executed. This means that before the next correct sorting result is applied, the previous sorting process may trigger several stalethencallbacks, causing the previous sorting results to “pollute” the new ones. Since the length of sorting results can differ each time, using an outdated sorting result can lead toindexesbeing too long or too short, which produces the error.Why does adding this line fix the issue? Because once the line is added, the
updatefunction will not executethis._sorterPromise.thenwhile sorting is still in progress. This prevents additionalthenchains from being attached, and therefore prevents the sorting results from being polluted.I’ve already tested this. After setting the state, the error no longer occurs.
It’s worth noting that due to the current performance issues in GaussianSplatPrimitive (mainly the large amount of data being copied and passed to the worker), sorting may not keep up in large scenes, which can result in flickering.
The fix I mentioned above only resolves the “Vertex buffer is not big enough for the draw call.” error and prevents abnormal frames. It does not solve the flickering issue caused by slow sorting in large scenes.
The reasoning sounds convincing, and the proposed fix is in line with my (subjective, shallow) understanding of what the update function is doing, and what the reason for the error is.
If adding this single line can prevent the error, then this could be added as a quick PR (and maybe become part of the next release).
Considerations for further cleanups, refactorings, and documentation of the update function may be tracked in a dedicated issue.