three.js icon indicating copy to clipboard operation
three.js copied to clipboard

Significant Performance Drop and High CPU Usage with BatchedMesh

Open lanvada opened this issue 1 year ago • 50 comments

Description

Hello,

I exported a building model from Revit in glTF format and merged meshes with the same materials to manage their visibility in Three.js using the BatchedMesh class. However, I've encountered a significant performance issue when rendering these merged meshes with BatchedMesh compared to using Mesh.

Performance Comparison:

  • BatchedMesh Rendering:
    • CPU Usage: ~40%
    • GPU Usage: ~30%
    • Frame Rate: ~20 FPS
  • Mesh Rendering:
    • CPU Usage: ~15%
    • GPU Usage: ~90%
    • Frame Rate: ~60 FPS

This drastic difference in performance is concerning, especially the high CPU load and low frame rate when using BatchedMesh. I've already set .perObjectFrustumCulled and .sortObjects to false in BatchedMesh, which, if set to true, leads to an even more severe frame rate drop.

Additionally, I'm using three-csm and postprocessing frameworks alongside Three.js.

System Configuration:

  • CPU: Intel i7-10700
  • GPU: NVIDIA RTX 2080 Super

Could someone help me understand why BatchedMesh increases the CPU overhead so significantly and suggest any possible optimizations or solutions to improve the frame rate?

Thank you!

Reproduction steps

Using BatchedMesh to render more than 10 million triangles and vertices, there are about 100,000 different geometries.

Code

Code in the project batched-mesh-performance-test

Live example

Code in the project batched-mesh-performance-test

Screenshots

No response

Version

r165

Device

No response

Browser

No response

OS

No response

lanvada avatar Jul 01 '24 08:07 lanvada

In the past few days, I’ve tried to find a solution, but without success. I’ve uploaded the relevant code and models to GitHub. The models consist of 12 million triangles and 16 million vertices. Is such a high CPU performance cost necessary for BatchedMesh? I don’t think it should be. When rendering Batched3DModel in Cesium, I didn’t encounter such issues. I believe the mesh batching in Cesium and BatchedMesh should be quite similar, right?

Additionally, I’d like to mention that after the update to version 166, the performance consumption of BatchedMesh has worsened, and the frame rate has dropped further in the same scene.

Here is the link to the code and models: batched-mesh-performance-test

lanvada avatar Jul 01 '24 08:07 lanvada

For the sake of easily understanding the issue please provide a live example that doesn't require pulling and running a separate Github project. You can host a demo page with Github pages, for example. Recordings of the Chrome performance monitor would be helpful, as well.

gkjohnson avatar Jul 01 '24 08:07 gkjohnson

Here is the demo link: https://batched-mesh-performance-test.vercel.app

The model is compressed using Draco and is approximately 44MB in size, with a total of 7.6 million triangles and 9.6 million vertices. It takes about 10 seconds to load the model. Initially, the page does not use BatchedMesh, and the frame rate on my computer is 60 FPS. You can switch to BatchedMesh by clicking the button on the bottom left, after which the frame rate drops to about 17 FPS.

lanvada avatar Jul 02 '24 04:07 lanvada

I need to provide some additional details. When exporting the glTF model from Revit, I grouped meshes with the same materials. I added three extensions: EXT_instance_features, EXT_mesh_features, and EXT_mesh_gpu_instancing. I also assigned a _FEATURE_ID_0 attribute to each vertex to differentiate between different batches, and this attribute is parsed during loading. The related code can be found in two TypeScript files in the project I previously provided: MeshFeatures.ts and GltfToolkit.ts. If you need to load the model for debugging, you might need to use the relevant code to parse the different batches. Since I have already used "_FEATURE_ID_0" to differentiate vertices of different batches, creating BatchedMesh could potentially be implemented by directly assigning values to internal properties (perhaps by renaming "_FEATURE_ID_0" to "_batchId"). However, I have not studied the BatchedMesh code in detail and have only used the BatchedMesh API to add geometries in a straightforward manner. This approach involves iterating over vertices and face indices and results in considerable additional memory allocation and copying overhead, making it inefficient.

lanvada avatar Jul 02 '24 04:07 lanvada

Thanks for producing a live link. I think this demo is too complicated to dig into, though. There are over 800 individual meshes and a mix of batched and instanced meshes as well as a lot of custom GLTF user code that make it difficult to understand what's going on. It think it would best if we had an example that used a single batched mesh compared to a merged mesh to show any performance differences. Ideally without any external geometry file dependencies.

gkjohnson avatar Jul 02 '24 09:07 gkjohnson

Thanks for producing a live link. I think this demo is too complicated to dig into, though. There are over 800 individual meshes and a mix of batched and instanced meshes as well as a lot of custom GLTF user code that make it difficult to understand what's going on. It think it would best if we had an example that used a single batched mesh compared to a merged mesh to show any performance differences. Ideally without any external geometry file dependencies.

Replicating this issue with a single BatchedMesh is actually quite "simple." You just need to increase the MAX_GEOMETRY_COUNT in the webgl_mesh_batch.html example to ten times its previous value. On my computer, when the geometryCount is 20,000, the CPU usage is around 20%. When the geometryCount is increased to 200,000, CPU usage rises to between 50% and 60%, yet GPU usage remains unchanged.

lanvada avatar Jul 02 '24 10:07 lanvada

Thanks for producing a live link. I think this demo is too complicated to dig into, though. There are over 800 individual meshes and a mix of batched and instanced meshes as well as a lot of custom GLTF user code that make it difficult to understand what's going on. It think it would best if we had an example that used a single batched mesh compared to a merged mesh to show any performance differences. Ideally without any external geometry file dependencies.

Replicating this issue with a single BatchedMesh is actually quite "simple." You just need to increase the MAX_GEOMETRY_COUNT in the webgl_mesh_batch.html example to ten times its previous value. On my computer, when the geometryCount is 20,000, the CPU usage is around 20%. When the geometryCount is increased to 200,000, CPU usage rises to between 50% and 60%, yet GPU usage remains unchanged.

Turning off the sortObjects, perObjectFrustumCulled, and useCustomSort options can reduce CPU usage by about 5%.

Additionally, I've noticed that enabling only the sortObjects option decreases the frame rate from 30 to 9. Is this a normal phenomenon?

lanvada avatar Jul 02 '24 10:07 lanvada

Replicating this issue with a single BatchedMesh is actually quite "simple."

I understand but I'm asking for a minimal reproduction case to be provided. I think it's a more than reasonable ask for a simple demonstration case separate from user code to made when reporting an issue and asking maintainers to spend time investigating. I can take a closer look once a this minimal repro is available.

Additionally, I've noticed that enabling only the sortObjects option decreases the frame rate from 30 to 9. Is this a normal phenomenon?

It depends on how many objects there are and where the bottleneck is. Frustum culling and sorting share a lot of the same logic, though, enabling one or the other will have a larger apparent impact then if one is already enabled and you enable the other. If you provide a simple reproduction case it will be easier to understand what you're describing.

gkjohnson avatar Jul 02 '24 10:07 gkjohnson

I can take a closer look once a this minimal repro is available.

It depends on how many objects there are and where the bottleneck is. Frustum culling and sorting share a lot of the same logic, though, enabling one or the other will have a larger apparent impact then if one is already enabled and you enable the other. If you provide a simple reproduction case it will be easier to understand what you're describing.

Ah, the case I mentioned above is actually based on the examples/webgl_mesh_batch.html. All I did was change the MAX_GEOMETRY_COUNT to 200,000 directly in the HTML, and then set it to this number in the browser. Give me a moment to fork this project and make the change, then I'll deploy it on Vercel. Alternatively, if it's convenient for you, you could just tweak the batch count limit in this example to replicate the issues I've mentioned.

lanvada avatar Jul 02 '24 10:07 lanvada

Sorry about this—I'm not very good at English, so I often rely on ChatGPT to help me write. If there are any impolite words or phrases, please forgive me...

lanvada avatar Jul 02 '24 11:07 lanvada

Ah, the case I mentioned above is actually based on the examples/webgl_mesh_batch.html. All I did was change the MAX_GEOMETRY_COUNT to 200,000 directly in the HTML, and then set it to this number in the browser. Give me a moment to fork this project and make the change, then I'll deploy it on Vercel.

If the sort behavior is separate from the original performance question then I'd prefer to focus on one thing at a time. You can ask at the forum if you'd like help understanding the performance implications of sorting objects.

Please provide a simple example in something like jsfiddle that shows the performance differences you're observing in https://github.com/mrdoob/three.js/issues/28776#issue-2383173340 without using any custom 3d model or complex feature processing logic.

gkjohnson avatar Jul 02 '24 14:07 gkjohnson

I've set up a page where you can switch between "BatchedMesh" and "MergedMesh". Here's the link: https://batched-mesh-performance-example.vercel.app/. Switching to "MergedMesh" might take about ten seconds or so.

What I've noticed is that when using "BatchedMesh", the CPU usage significantly increases—from 15% to 40% on my computer.

I did a quick debug with Spector.js and found that enabling the sortObjects option causes the texSubImage2D function to take up too much time, leading to a drop in frame rate. However, when I turn off sortObjects, only the multiDrawElementsWEBGL function remains. I'm wondering if the increase in CPU usage is a necessary cost of using multiDrawElementsWEBGL.

Also, another issue is when there are many materials in the scene (multiple BatchedMeshes or MergedMeshes), using the "MergedMesh" method allows the GPU to perform at its best, nearing 100% utilization. But with the "BatchedMesh" method, the GPU utilization seems to be about the same as when there's only a single material—around 30%.

I'm not sure if the above situations can be optimized, or is this just the nature of the WebGL API?

lanvada avatar Jul 03 '24 10:07 lanvada

Replicating this issue with a single BatchedMesh is actually quite "simple." You just need to increase the MAX_GEOMETRY_COUNT in the webgl_mesh_batch.html example to ten times its previous value. On my computer, when the geometryCount is 20,000, the CPU usage is around 20%. When the geometryCount is increased to 200,000, CPU usage rises to between 50% and 60%, yet GPU usage remains unchanged.

If you set geometryCount to 200000, then the freezes are due to the update _indirectTexture. And of course, this is due to the fact that there are a lot of instances and do loop through them all takes a lot of time. You can see from this screenshot image

@gkjohnson, @lanvada

Shakhriddin avatar Jul 04 '24 07:07 Shakhriddin

I've made a simpler example that just uses javascript and cubes to understand things a bit better. This demo allows for changes between a merged geometry, batched mesh, and instanced mesh by changing the "MODE" flag at the top. It also removes any extra texture sampling logic used in BatchedMesh to remove that as a possible performance bottleneck:

jsfiddle link

I'm seeing that between the three options, BatchedMesh is the only one that suffers from this performance degradation. Instances and merged geometry both work fine otherwise. InstancedMesh and the merged geometry run at 120 fps while the BatchedMesh runs at ~30 fps on my 2021 M1 Pro Macbook.

In terms of why this is happening - my only guess is that it's due to the buffers of draw "starts" and draw "counts" that must be uploaded to the GPU for drawing every frame, which will amount to ~1.6 MB of data for 200,000 items. It's hard to say for sure, though, because this isn't showing up on the profiler. It's possible that this GPU data upload is happening asynchronously and not reflected in the profiler unlike some of the texture upload function calls.

In the original example all of the problematic BatchedMesh sub geometry draws seem to be unique so unfortunately without something like indirect draw support (supported in WebGPU) I think this is just pushing the limits of what we can do with BatchedMesh too far.

gkjohnson avatar Jul 08 '24 09:07 gkjohnson

I've made a simpler example that just uses javascript and cubes to understand things a bit better. This demo allows for changes between a merged geometry, batched mesh, and instanced mesh by changing the "MODE" flag at the top. It also removes any extra texture sampling logic used in BatchedMesh to remove that as a possible performance bottleneck:

Thank you very much for your response and for creating a new example. Does this mean that the operation causing the increase in CPU usage on my computer could be the data upload to the GPU? Another phenomenon is that on my desktop with a dedicated GPU, the GPU utilization can reach over 80% in examples not using BatchedMesh, but with BatchedMesh, it only peaks at 30%. Could this be due to the GPU waiting for data uploads?

It's frustrating that whether it's the issue of rising CPU usage or the GPU not performing at full capacity, it seems to be a problem inherent to WebGL itself, and it appears to be unsolvable. However, you mentioned indirect draw support in WebGPU. If I switch to using WebGPURenderer, would it resolve these WebGL bottlenecks? If it's theoretically feasible, I might try switching the renderer in my current project to WebGPU.

lanvada avatar Jul 09 '24 03:07 lanvada

increase in CPU usage on my computer could be the data upload to the GPU ... Could this be due to the GPU waiting for data uploads?

If what I've suggested is the cause - then yes it would explain the higher CPU usage and less GPU usage.

If I switch to using WebGPURenderer, would it resolve these WebGL bottlenecks?

I'm not aware of the current capabilities of three.js' WebGPURenderer, so I can't say. But I expect it to eventually be supported if it's not now.

gkjohnson avatar Jul 09 '24 03:07 gkjohnson

I'm not aware of the current capabilities of three.js' WebGPURenderer, so I can't say. But I expect it to eventually be supported if it's not now.

Thank you for your insights. I'll look into the current state of three.js' WebGPURenderer and see if it supports the features needed to overcome these limitations. If it's not currently supported, I'll keep an eye on updates. Your explanation has been very helpful in clarifying the potential causes of the performance issues I'm facing.

lanvada avatar Jul 09 '24 04:07 lanvada

If I switch to using WebGPURenderer, would it resolve these WebGL bottlenecks?

I'm not aware of the current capabilities of three.js' WebGPURenderer, so I can't say. But I expect it to eventually be supported if it's not now.

I switched to the WebGPURenderer in this example batched-mesh-performance-example, but unfortunately, I found that the frame rate with BatchedMesh is even lower now...

lanvada avatar Jul 16 '24 07:07 lanvada

It could be something else. For me, the frame rate with BatchedMesh in WebGL is 8 FPS, but 22 FPS in WebGPU.

John-Simth avatar Jul 17 '24 02:07 John-Simth

It could be something else. For me, the frame rate with BatchedMesh in WebGL is 8 FPS, but 22 FPS in WebGPU.

What graphics card and operating system are you using? Also, which browser are you using? My graphics card is an RTX 2080 Super, and I'm on Windows using Chrome.

lanvada avatar Jul 18 '24 08:07 lanvada

It seems you are using batchedMesh in an incorrect way. You should create a single batchedMesh and then add meshes with the same material into it, rather than creating a separate batchedMesh for each individual mesh.

I definitely didn't make a mistake there; of course, I created only one BatchedMesh. You can also see through Spector.js that there is only one draw call. How could it be that multiple BatchedMeshes were created?

lanvada avatar Jul 18 '24 08:07 lanvada

It seems you are using batchedMesh in an incorrect way. You should create a single batchedMesh and then add meshes with the same material into it, rather than creating a separate batchedMesh for each individual mesh.

Are you referring to the "batched-mesh-performance-test" project? That example was too complex and is no longer in use. You can check this one instead: batched-mesh-performance-example. However, even in the batched-mesh-performance-test example, if you carefully read the code related to the creation of BatchedMesh, you would see that I created only one BatchedMesh for each identical material, not multiple BatchedMeshes.

lanvada avatar Jul 18 '24 08:07 lanvada

It could be something else. For me, the frame rate with BatchedMesh in WebGL is 8 FPS, but 22 FPS in WebGPU.

What graphics card and operating system are you using? Also, which browser are you using? My graphics card is an RTX 2080 Super, and I'm on Windows using Chrome.

My graphics card is an RTX 2050 4GB. I tested it batched-mesh-performance-example on Edge and Chrome and they both performed nearly 8 FPS in WebGL and 17-22 FPS in WebGPU! I'm not sure why I'm different from you.

John-Simth avatar Jul 18 '24 09:07 John-Simth

I deleted my previous post because I misunderstood something.

I believe in the current version of BatchedMesh, multiDrawArraysInstancedWEBGL is not used. It is not used in the examples provided by @gkjohnson and @lanvada

So what is being compared in examples above:

  1. one call to multiDrawElementsWEBGL with very large starts/counts arrays (100k elements)
  2. one call to drawElementsInstanced with one geometry and a large number for primcount (=100k)
  3. one call to drawElements with one (giant) geometry

IIUC, the results are NOT actually surprising or that bad. multiDrawElementsWEBGL with large starts and counts arrays is an optimization on calling drawElements thousands of times. In practice, it means you can maintain 60fps with 40k virtual draw calls instead of 5k real draw calls (or VAO bindings).

The specific workflow of @lanvada which is revit CAD data, should probably not use multiDrawElementsWEBGL in this way. One single mesh is a great approach if it is static. But alternatively it should use multiDrawArraysInstancedWEBGL , since he has something like 800 unique geometries but many of instances of each. We don't really have a benchmark of that, but according to these presentiations of nvidia it should work well:

https://on-demand.gputechconf.com/gtc/2013/presentations/S3032-Advanced-Scenegraph-Rendering-Pipeline.pdf https://on-demand.gputechconf.com/siggraph/2014/presentation/SG4117-OpenGL-Scene-Rendering-Techniques.pdf

nkallen avatar Jul 24 '24 11:07 nkallen

But alternatively it should use multiDrawArraysInstancedWEBGL , since he has something like 800 unique geometries but many of instances of each.

Unless there's something odd in the way the model data is being stored this isn't the case - the original demo in the OP creates InstancedMeshes for anything with instances, and then everything in BatchedMesh is a unique geometry. That's how it appears from the current parsing logic, at least.

IIUC, the results are NOT actually surprising or that bad.

Agreed but the surprising thing is that this uploading timing doesn't seem to show up at all on the measured performance metrics. It makes it difficult to understand where exactly this is coming from. But as I've mentioned I assume it's from the start and counts buffer uploads.

If there are practical use cases shown that multiDrawArraysInstancedWEBGL significantly improves in this respect I'm open to switching the BatchedMesh implementation. I just don't think it will address this specific case. cc @RenaudRohlinger

gkjohnson avatar Jul 24 '24 12:07 gkjohnson

I think as we get into these extreme performance cases where multiDrawArraysInstancedWEBGL might be beneficial, it's probably best for users to explicitly invoke the gl calls. It's not elegant, but it can be done using standard materials and onAfterRender to issue the draw call

nkallen avatar Jul 24 '24 13:07 nkallen

Since the support of multiDrawArraysInstancedWEBGL introduced by https://github.com/mrdoob/three.js/pull/28103 in the WebGLRenderer it is still possible to have a custom BatchedMesh class that supports batch instanced draw calls by using the object._multiDrawInstances property (which I'm doing for a project).

Furthermore, using the same object._multiDrawInstances approach, I recently submitted two pull requests to reinstate multiDrawArraysInstancedWEBGL support in the WebGL Backend and to implement a compatibility for the WebGPURenderer: https://github.com/mrdoob/three.js/pull/28753 https://github.com/mrdoob/three.js/pull/28759

So it's a bit tricky but as long as we keep the _multiDrawInstances property we can still use multiDrawArraysInstancedWEBGL with both renderers without gl calls.

RenaudRohlinger avatar Jul 24 '24 13:07 RenaudRohlinger

is still possible to have a custom BatchedMesh class that supports batch instanced draw calls by using the object._multiDrawInstances property (which I'm doing for a project).

Of course but the goal here is to enable this without end-users having to write custom shaders to take advantage of the functionality. It's been suggested it would happen multiple times but it would be nice if someone shared a public demonstration of how multiDrawArraysInstancedWEBGL and _multiDrawInstances is being used in practice so we can discuss the pros / cons and how / if it should be used in a three.js class.

The big question for me is how you calculate the item index using the gl_DrawID and gl_InstanceID to sample from a tightly packed data texture / buffer (ie for the matrix transform texture) when you're drawing sets of instances with different counts.

gkjohnson avatar Jul 24 '24 13:07 gkjohnson

The big question for me is how you calculate the item index using the gl_DrawID and gl_InstanceID to sample from a tightly packed data texture / buffer (ie for the matrix transform texture) when you're drawing sets of instances with different counts.

The way that I am thinking about doing it is looking up an offset and a count in one texture based on gl_DrawID, and then addressing a second texture with offset + count * gl_InstanceId * sizeof_transform

I'm using this (experimental) translation of OffsetAllocator based on the work of Sebastian Aaltonen, which is how I manage buffers explicitly, it has extremely high occupancy. I'm not proposing to add this to threejs though

https://gist.github.com/nkallen/f4ed889dc98e9a9da7283a01e3308450

nkallen avatar Jul 24 '24 14:07 nkallen

But I should note also: it is possible to put everything in the same buffer and just issue thousands of calls to drawElementsInstanced. It's extremely fast because you do not need to switch the VAO. For example, in the below code, note that I am just incrementing the offset of the vertexAttribPointer. I have benchmarked this on apple and amd gpus and it can do 5k calls to drawElementsInstanced in < 1ms

    onAfterRender(...) {
        const { _multiDrawStarts_, _multiDrawCounts_, _multiDrawCount_ } = this;
        const gl = renderer.getContext() as WebGL2RenderingContext;

        gl.bindBuffer(gl.ARRAY_BUFFER, this.geometry.attributes.instanceStart.data.buffer);

        for (let i = 0; i < _multiDrawCount_; i++) {
            const start = _multiDrawStarts_[i];
            const primcount = _multiDrawCounts_[i];

            const offset = start + i * 24;
            gl.vertexAttribPointer(3, 3, 5126, false, 24, offset);
            gl.vertexAttribPointer(4, 3, 5126, false, 24, offset + 12);

            gl.drawElementsInstanced(gl.TRIANGLES, 18, gl.UNSIGNED_SHORT, 0, primcount / 6);
        }
    }

nkallen avatar Jul 24 '24 14:07 nkallen