godot icon indicating copy to clipboard operation
godot copied to clipboard

Import a large glb file (778MB) which contains 800 models will crash the editor.

Open AllenDang opened this issue 1 year ago • 20 comments
trafficstars

Tested versions

4.2 stable

System information

macOS 14.5 - forward+ - godot 4.2 stable

Issue description

Import a large glb file (778MB) which contains 800 models will crash the editor.

Steps to reproduce

  1. Create a new project.
  2. Drag and drop the large glb file into editor.

Minimal reproduction project (MRP)

Here is the glb file https://drive.google.com/file/d/1f74-29422AmZQJohng74ySdELGJptgSA/view?usp=sharing

AllenDang avatar Jun 25 '24 07:06 AllenDang

Can you check 4.3? The cow data size was increased to a larger number

fire avatar Jun 25 '24 11:06 fire

Tried on latest from github (4 or 5 days ago). I hang on import. Restarting the editor restarts and re-hangs the import, automatically. For some reason my Attach to Process is being disconnected and reattaching it doesnt show me the Call Stack. (Mind currently blown.) Just pulled latest and recompiling.

Sluggernot avatar Jun 25 '24 12:06 Sluggernot

I tried this on 4.3.beta2.official and although it was very slow, it did eventually load after about 6 minutes (during the whole time it appeared stuck at 0%). grafik

Opening the scene took a couple more minutes: grafik This was on Ubuntu 24.04. Edit: Godot uses about 9 GB of RAM with this scene open.

lvcivs avatar Jun 25 '24 13:06 lvcivs

@lvcivs I created this file just for testing purpose, want to see how godot will handle it :P

AllenDang avatar Jun 25 '24 14:06 AllenDang

After transferring the model to Godot 4.3 beta2, it still didn’t load for me, I waited 28 minutes, then closed it. I also tested this on Blender 3.6.2, waited 3 minutes and Blender closed itself, which didn't happen with Godot.

Godot v4.3.beta2 - Windows 10.0.19045 - Vulkan (Mobile) - dedicated Radeon RX 560 Series (Advanced Micro Devices, Inc.; 31.0.14001.45012) - Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (4 Threads)

JekSun97 avatar Jun 25 '24 19:06 JekSun97

Next steps is to get profiles for the load.

My recommendations is use either https://github.com/mstange/samply or https://superluminal.eu/

fire avatar Jun 26 '24 01:06 fire

Yes. I have been able to load the file. I did some quick benchmarking with Visual Studio and have a couple of very small efficiencies made locally. I need to benchmark the before and after when I get some really good changes made to this. Main finding is that _parse_meshes is the main function loading this file. My changes are to GenerateSharedVerticesIndexList and one small one to static SVec3 GetPosition().

Sluggernot avatar Jun 26 '24 01:06 Sluggernot

I will try to review any pull requests that can improve load times on the 777mb glb with nothing broken.

fire avatar Jun 26 '24 01:06 fire

Oh... Nothing broken? Ah, nevermind then. Really, yes my first challenge is proving that it is faster. Thanks!

Sluggernot avatar Jun 26 '24 01:06 Sluggernot

Ok, I didnt know github would add these comments from my own fork because I referenced the Issue in the description. I will be avoiding that in the future.

Sluggernot avatar Jun 26 '24 04:06 Sluggernot

Since I ended up looking into this a little bit, I'll share my findings in hopes that it will help.

Measured by clicking "Reimport" on the scene in an otherwise empty project, --verbose says import took 276 seconds (that's a little under 5 minutes). Note that the scene has ~800 meshes that add up to ~39.3M triangles (~50k each, looks reasonably uniformly distributed). Overall I would have expected one mesh per scene here, but I'm not familiar with how Godot workflows work, and it's a good stress test regardless.

perf profile on Linux / editor build with default settings with fno-omit-frame-pointers -- please note that timings add up to 45% (perf doesn't normalize them):

image

Renormalizing the percentages by dividing by 0.45, and focusing on significant underlying components, we get:

  • 5% scene save
  • 14% tangent space generation
  • 25% normal reprojection after LOD generation (raycasts)
  • 29% simplification (meshopt_simplify)
  • 24% the rest of generate_lods (it's inlined here so hard to see from the profile exactly)

In aggregate, LOD generation takes ~78% here, so definitely good to focus on that. When looking at something like a 5-minute import though, my expectations are usually that small gains are not terribly exciting, so something more significant needs to happen.

A note on the scale here: each mesh gets approximately 6 LOD levels generated. The work for meshopt_simplify scales with that; the work for normal reprojection scales with the total number of rays, which scales with the total number of triangles in all LODs, times the area factor - looks like we cast 16..64 rays which is a lot of rays :)

If I were tackling this problem, I would entertain the following projects:

  1. For scenes with many large meshes like this, my first goal would be to process meshes in parallel. I'm not familiar with the details of ImporterMesh code but superficially nothing should prevent fully generating each mesh in parallel. Maybe that requires refactoring some of this code to actually be thread-safe. It would also require making sure that the dependent code is thread-safe internally - meshopt definitely is, I assume so is Embree, but some care would be required. That alone would probably get this to be under a minute on an 8-core system if we discount tangent space generation.

  2. I'm skeptical that tangent space generation is efficient here. For a sense of scale, meshopt_simplify does a fair bit more work per call, and it's called ~6 times per mesh here and still only takes twice as much time. I would assume tangent space generation has internal algorithmic inefficiencies and could be improved, but I haven't looked at that code myself.

I would not advise trying to optimize the internals of meshopt_simplify (trust me...). Some small future performance improvements are planned here in meshoptimizer but largely speaking unless this runs into some edge case, which it doesn't look like it does to me, it should be very well tuned already. Same for Embree - I would assume it's impractical to optimize that to the degree that is relevant here. However:

  1. I would certainly think of, at the minimum, reducing the amount of requested work from both meshopt_simplify and Embree here. Notably, meshopt_simplify is called approximately 6 times per mesh here and is asked to generate larger and larger meshes. Because of this, it does more or less the same amount of work: simplifying the mesh 2x is almost the same effort as simplifying the mesh 10x (... well, not quite, but it gets there quickly). However, in LOD chain generations you can usually generate the LODs in the opposite direction: start by requesting a ~1.5x smaller mesh, if that target is reached, ask for ~1.5x smaller mesh again, etc. I don't recall why the order here is reversed but I would consider flipping it and simplifying from the last LOD. I don't think that's going to reduce the work here 6x, but I would expect something like 3-4x improvement in cost to call simplify.

  2. In a similar vein, casting 16-64 rays per triangle is a lot, especially for higher levels of detail. I would probably reduce this in general or at least scale this as the LOD levels get closer to original mesh: in the limit, we're casting at least 16 rays per triangle here for something that only has 1.5x fewer triangles than original mesh, and that just feels wasteful. This has a risk of reducing the quality of the resulting normals because there's a higher chance of missing the mesh or hitting a wrong triangle. Maybe ray casts here aren't the right fit and averaging triangle normals from triangles that are in a bounding sphere of the generated triangle is better, but this brings me to my final point:

  3. We've already discussed this at some point in another issue, but overall I'm not 100% sure the current normal processing in the importer for LODs is generally beneficial. With the normal aware simplifier with the recent fixes, generally speaking I'd expect decent normals to come out of the simplifier itself. Sometimes that's not the case, but I'm not sure the ray cast logic is perfect either, and it's just a lot of complexity to always keep in mind. I do think the reindexing that happens in this code is beneficial for some faceted meshes though. So a good use of time would be to perhaps introduce an option for normal reprojection that would disable the ray cast based normal recreation (I'd expect that alone cuts half of the overhead of LOD generation here), test the option in a release, then maybe default it to skip the normal recreation and see if this comes up.

Hopefully this is helpful :) I would be happy to discuss (3)/(5) further and/or maybe contribute a patch or two as I'm generally interested in making sure simplification integration is working well for Godot; I'll leave 1/2/4 to others if they are motivated to work on this.

zeux avatar Jun 28 '24 03:06 zeux

On "I'm not 100% sure the current normal processing in the importer for LODs is generally beneficial", I decided to do a quick comparison on the scene from this file. It looks like it's easy to disable normal override, basically just need to disable the ray caster creation (as mentioned earlier, I believe current splitting logic to be generally beneficial for faceted meshes). I then look at a few low LODs (where the risk of picking a bad normal due to ray casts is maximized), by tuning the LOD bias to be a very small value.

On the left (yes, left, I double checked!) is the import without using the raycaster. On the right is current master (raycaster enabled). Both levels are at ~2200 triangles. I see somewhat similar issues on a few other models - this is not universal, this happened to be the first model I checked, and some models from this scene look about the same with or without the raycaster enabled. But this to me is strong evidence that raycaster should be optional, and probably opt-in.

image image

I've switched to using a smaller version of the scene from the original post (that one has 800 meshes but each mesh is duplicated 8 times, I've switched to a deduplicated version where there's only 100 meshes, easier to work with and faster to reimport). Reimport takes 37 seconds on master and 22 seconds without raycaster enabled.

zeux avatar Jun 28 '24 23:06 zeux

Wow, well that is surprising. Are there any examples where the raycaster was better in visual fidelity. (I understand that's somewhat subjective but your above screenshot feels fairly objective as to which is "better.") I've been diving further into this section of code throughout the day today, attempting to rally myself before trying multithreading. I really appreciate your write-up. This is absolutely great to see!

Sluggernot avatar Jun 28 '24 23:06 Sluggernot

As someone who works on this, I am supporting changes that improves quality and performance. Can review and help test.

fire avatar Jun 28 '24 23:06 fire

After trying to import this glb file on a s23 + mobile it ends up crashing after some time, so this does not look to be the cow fault , using https://github.com/godotengine/godot/pull/93064 as it the fastest when loading big project along with the other pr which still causes it to crash on reimport.

Saul2022 avatar Aug 20 '24 16:08 Saul2022

@Saul2022 does it also crash on your pc?

Edited:

I would expect like 10-20 gigabytes of cpu ram to be used too.

fire avatar Aug 20 '24 16:08 fire

@Saul2022 does it also crash on your pc?

Can't test on pc, sorry , it dead, only black screen despite the light thing is working, so prob screen issue.

Edit: Also tried without lods or shadow mesh, lightbake enabled , by adjudting it on import defaults, and still crashes , so it not lods..

Saul2022 avatar Aug 20 '24 16:08 Saul2022

I've tried this with v4.3.stable.official [77dcf97d8] and this is the resulted Godot's memory crash dump:

godot.exe.14296.zip

My specs:

specs

anderlli0053 avatar Aug 26 '24 22:08 anderlli0053

I suspect that developers loading that 3d asset require more than 16GB of ram.

We can check how big the difference is. If the requirements is closer to 32 gb then it's a lot harder than like 18gb.

Godot Engine 4.3-stable

Edited:

I'll try to get a cpu usage chart via samply or https://superluminal.eu/ using a custom build of 4.3-stable

Edited:

  1. Download https://drive.google.com/file/d/1f74-29422AmZQJohng74ySdELGJptgSA/view?usp=sharing
  2. Apple M2 Pro with 32GB of ram.
  3. curl --proto '=https' --tlsv1.2 -LsSf https://github.com/mstange/samply/releases/download/samply-v0.12.0/samply-installer.sh | sh
  4. scons production=yes debug_symbols=yes @ https://github.com/godotengine/godot/releases/tag/4.3-stable
  5. ./bin/godot.macos.editor.arm64 #create a new-game-project
  6. rm -rf ~/Documents/new-game-project/.godot
  7. samply record ./bin/godot.macos.editor.arm64 -e --path ~/Documents/new-game-project/
  8. Drag asset gltf file.
  9. Open asset gltf file as a scene.
  10. Firefox Profiler with stack traces! https://share.firefox.dev/3AH8zLh
  11. I saw around 19 GB of max usage, but I don't have logging.

Godot Engine master

Edited:

  1. Download https://drive.google.com/file/d/1f74-29422AmZQJohng74ySdELGJptgSA/view?usp=sharing
  2. Apple M2 Pro with 32GB of ram.
  3. curl --proto '=https' --tlsv1.2 -LsSf https://github.com/mstange/samply/releases/download/samply-v0.12.0/samply-installer.sh | sh
  4. scons production=yes debug_symbols=yes @ https://github.com/godotengine/godot/commit/db76de5de8a415b29be4c7dd84b99bd0fe260822
  5. ./bin/godot.macos.editor.arm64 #create a new-game-project
  6. rm -rf ~/Documents/new-game-project/.godot
  7. samply record ./bin/godot.macos.editor.arm64 -e --path ~/Documents/new-game-project/
  8. Drag asset gltf file.
  9. Open asset gltf file as a scene.
  10. Around 18 GB of max ram usage during import
  11. Around 9GB when using the internal godot engine formats and loaded the 3d asset in the editor. image
  12. Firefox Profiler with stack traces! https://share.firefox.dev/3X4zE2k

Notes

  1. https://github.com/godotengine/godot/pull/93727 is expected to reduce memory usage.
  2. May be able to optimize import runtime by making GLTFDocument::_parse_image_save_image parallel @ https://github.com/godotengine/godot/commit/db76de5de8a415b29be4c7dd84b99bd0fe260822 image

fire avatar Aug 26 '24 22:08 fire

What is the expected behaviour if we exceed the ram (like 19gb usage on a 16 gb - 14gb free) on the system?

Edited:

Personally I think requiring more ram and crashing is expected on large datasets.

We can attempt to use less memory, but there will always be a dataset that exceeds a limit.

fire avatar Aug 26 '24 23:08 fire

We can attempt to use less memory, but there will always be a dataset that exceeds a limit.

Ye i guess, i tried with multithread import off, vsync dissable and continous update ,but still crash. Though the image files did import though, except the glb scene. Maybe to avoid crash instead of crashing the engine, make it so before a crash happens, quit the import process and print an error message about not enough ram to import the scene.

Saul2022 avatar Aug 27 '24 07:08 Saul2022