CuraEngine icon indicating copy to clipboard operation
CuraEngine copied to clipboard

Improve mesh loading

Open Piezoid opened this issue 2 years ago • 4 comments

This is a work in progress of a re-implementation of vertices merging during mesh loading, made to be more precise and more efficient. A 3DBenchy no longer triggers the warning Mesh has overlapping faces!.

Background: Mesh::addFace lockups vertices in a spatial hash map. This recovers the connectivity while accommodating for inaccuracies in STLs. Vertices closer than a 0.03 mm are merged together.

Previously, a vertex was inserted into a 0.03 mm wide cubic bucket. During lookup, a single bucket is searched for a neighboring vertex. Vertices close the the side of one cube will miss merge candidates from neighboring cubes. This could lead to unexpected behavior, and exacerbate the order sensitivity of the greedy merge algorithm.

This implementation uses half-open 0.064mm wide cubic buckets for merging vertices closer than 0.032 mm. It is optimized for queries: on insertion, a vertex is inserted in all of the 8 buckets overlapping with the 0.032mm radius sphere centered on the vertex. A query gets all candidates vertices from a single bucket.

For this purpose I wrote a specialized hash map. Iterating over a bucket is accomplished by iteratively probing the hash map. This gets some collisions during lookup, but they are filtered by the distance checks anyway. Using std::unordered_map, as in the current implementation, requires an indirection to dynamically sized buckets and can't combine collision checks with distance checks.

The hash map implementation is minimalist but supports rehashing when the number of vertices can't be estimated beforehand. Some cache locality is recovered from the inherent spatial locality of meshes:

  • Point coordinates are hashed to their Morton codes,
  • The probing strategy is copied from Python's dict,
  • Load factor is 2/3, growth factor is 4. Map sizes are powers of two, starting from 2^10.

This was a fun exercise, but I'd like to know if there is an interest in this before polishing it. I'm in the process of benchmarking, and experimenting with different strategies and constants. (I will update with the results). But overall, the one above seems to perform the best for it's simplicity.

Piezoid avatar Dec 04 '21 21:12 Piezoid

This was a fun exercise, but I'd like to know if there is an interest in this before polishing it.

I always like optimising that sort of stuff too :)

Currently we don't really have an issue with the performance or correctness of this piece of code. Do you have a case where the vertex look-up goes wrong, or maybe one where it takes a significant amount of time compared to the rest? That last bit is obviously subjective, but I'd like to know what drove you to look in here.

When deciding to merge something, the most major factors we tend consider at are:

  • Does it solve an issue that users face? (maybe with the discretisation of those buckets?)
  • What is the impact on performance? (positive impact supposedly, but as far as I know it would be indistinguishable since mesh loading is like 0.1s most of the time)
  • Is it useable for the user, UX-wise? (irrelevant in this case)
  • Is the code maintainable? (this does add some complexity, so that's a trade-off)

So if this PR doesn't really solve any correctness issue and doesn't solve any performance issue, I'd say that the extra complexity and risk of new bugs is not worth it. But you started this for a reason I think. It would be good to show us what it fixes :) Or else reduce the complexity of the code we'd have to maintain.

Previously, a vertex was inserted into a 0.03 mm wide cubic bucket. During lookup, a single bucket is searched for a neighboring vertex. Vertices close the the side of one cube will miss merge candidates from neighboring cubes. This could lead to unexpected behavior, and exacerbate the order sensitivity of the greedy merge algorithm.

That would indeed be wrong. In other places (SparsePointGrid) we look explicitly into neighbouring cells as well and then do a range check on the contained points, to get a proper pre-filter by the grid.

Also, please look into our code style at the Meta repository. In particular:

  • Each class gets their own file, so HashMap3D would have to be taken to a separate file.
  • Opening brackets go on a new line. Probably this is just your style that you were planning to clean up at the end, though for the commit history it would be better to get it right from the start.

And I think you can't use that Boost small vector. We currently only require the Polygon and Voronoi packages. The CI is complaining about it anyway.

Ghostkeeper avatar Dec 16 '21 10:12 Ghostkeeper

Currently we don't really have an issue with the performance or correctness of this piece of code. Do you have a case where the vertex look-up goes wrong, or maybe one where it takes a significant amount of time compared to the rest? That last bit is obviously subjective, but I'd like to know what drove you to look in here.

For now, all I can say with certainty is that this silences the overlapping face warning on a 3DBenchy model. However, I think this model is broken, and I have to investigate in which ways and try other broken meshes.

So if this PR doesn't really solve any correctness issue and doesn't solve any performance issue, I'd say that the extra complexity and risk of new bugs is not worth it. But you started this for a reason I think. It would be good to show us what it fixes :) Or else reduce the complexity of the code we'd have to maintain.

I was myself hesitant when evaluating the cost/benefits of this. That is why I wanted to show code and benchmarks before going further down the path.

Some very large meshes take more than a second to load. Memory usage could also be of a concern. I agree than going from mostly instantaneous to more instantaneous is not a major improvement. For illustrating this, the benchmarks will include improvements measures relative to the load time and overall slice time.

As to why I looked into this: low hanging optimization opportunities are often in the unparallelized code paths.

That would indeed be wrong. In other places (SparsePointGrid) we look explicitly into neighbouring cells as well and then do a range check on the contained points, to get a proper pre-filter by the grid.

Actually, I'm starting to think that the original code was (mostly) right by not trying to hard to merge vertices. I've read the code of few mesh loader/indexers (like erizo) and most of them only merge vertices having the exact same coordinates. This works under the assumption that the mesh exporter serialized an identical (bit)string in all occurrences of the same vertex.

Thus, maybe the implementation should only perform strict merging, and then rely on the stitching of 2D polygons to close malformed paths. This would give another ~8x improvement in memory usage and CPU time to the HasMap3D implementation.

However, merging vertices in a way that is locally dependent on the index grid, as done in the Cura's current implementation, is wrong in subtle ways. I'd expect some sort of aliasing artifacts when the models undergoes decimation when vertex are merged. But as you said, it was never obvious this could be a problem by looking at sliced outputs. I could try to lower the grid resolution for demonstration purposes, but that would be a bit too artificial.

I need some time to test these hypotheses. That's why I've delayed the benchmarks.

Also, please look into our code style at the Meta repository. In particular:

Right. I'm aware of the guidelines. I didn't spent the time configuring my editor for them. I generally clean that after the fact while reviewing patches.

And I think you can't use that Boost small vector. We currently only require the Polygon and Voronoi packages. The CI is complaining about it anyway.

Yes, I wasn't expecting this since boost came as a monolith on my platform. Although it is only a single commit that needs to be dropped.

Piezoid avatar Dec 16 '21 16:12 Piezoid

Thus, maybe the implementation should only perform strict merging, and then rely on the stitching of 2D polygons to close malformed paths. This would give another ~8x improvement in memory usage and CPU time to the HasMap3D implementation.

I agree with this. Since we're using it for debugging, it is nice if the mesh loading in CuraEngine works the same way as the mesh loading in Cura's front-end. This uses numpy-stl which does no deduplication. Improved performance and simplicity are bonus then.

Ghostkeeper avatar Dec 27 '21 12:12 Ghostkeeper

I've benchmarked the mesh loading code with a large STL (383M binary STL, 5.4M vertices and 8M faces). Mesh is loaded through libArcus. To skip the rest of the slicing process, a exit(3) is added at the end of ArcusCommunication::Private::readMeshGroupMessage. For each test /usr/bin/time -v ./CuraEngine connect 127.0.0.1:49675 is ran three times, taking the run with the best wall-clock time.

master ran in 8.35s wall-clock (7s user, 1.1s system) with a maximum RSS of 1.21Go and 755K page faults.

The present code in this PR, with vertices merging, took 5.84s (4.53s user, 0.91s system) with a maximum RSS of 1.09Go and 606K page faults. This is a 30% reduction of run time compared to master.

Removing the merging of close vertices, the run time drops to 4.24s (3.29s user, 0.77s system) with a maximum RSS of 1.1Go and 512K page faults. This adds a reduction of 27%, (49% compared to master)

I noticed that the maximum RSS was not affected by skipping deduplication. This is because the peak is reached during the shrink_to_fit() call on vertices and faces vectors, which double the memory occupation during the copy. I added guards that only performs shrinking when more than 1/3 of the space is wasted by over-allocations. No shrinking happends when using libArcus which provide the number of faces beforehand. This lowers the runtime to 3.96s (3.20s user, 0.47s system) with a peak RSS of 824Mo and 388K page faults. The size of allocations inside Mesh are comparable to those happening inside libArcus/Protobuf.

This was best improvement I could get by tuning this change set: 53% reduction in mesh loading time compared to master.

The tests above were performed using boost::container::small_vector to store the connected faces ids in instances of MeshVertex. Using std::vector instead increases the runtime to 5.02s (+27%, 4.20s user, 0.55s system) with a peak RSS of 886Mo and 542K page faults. Memory usage and page faults increase a bit from fragmentation.

Running the full slicing process on the test STL with the fdmprinter's Normal - 0.15 mm profile, took 14.9s (137.6s user, 2.17s system) with a peak RSS of 1.4Go and 844K page faults, resulting in 292 layers. This is with the optimized mesh loading and small_vector. Doing the same with master took 20.8s (143.2s user, 2.35s system) with a peak RSS of 1.93Go and 1.53M page faults. Again, I took the best of 3 runs, but this time the results were much more volatile because of the multithreaded scheduling (on a old NUMA machine, 2x 6C12T). Still, I wasn't expecting a 27% improvement, something could be off...

Here's a screenshot of what it went through :sweat_smile:
image

Piezoid avatar Dec 28 '21 17:12 Piezoid