cgal Use Abseil hash maps instead of std unordered containers for a huge speedup

Issue Details

Inspired by one of the cppcon talks I listen to during my free time, I have quickly tried the hash maps from Abseil instead of std::unordered_map.

My try was a modification of copy_face_graph:

diff --git a/BGL/include/CGAL/boost/graph/copy_face_graph.h b/BGL/include/CGAL/boost/graph/copy_face_graph.h
index 9a2cb628825..fd0b602f284 100644
--- a/BGL/include/CGAL/boost/graph/copy_face_graph.h
+++ b/BGL/include/CGAL/boost/graph/copy_face_graph.h
@@ -35,6 +35,7 @@
 #include <boost/unordered_map.hpp>
 #include <boost/utility/enable_if.hpp>
 #include <boost/function_output_iterator.hpp>
+#include <absl/container/flat_hash_map.h>
 
 namespace CGAL {
 
@@ -189,7 +190,7 @@ void copy_face_graph(const SourceMesh& sm, TargetMesh& tm,
   typedef typename boost::graph_traits<SourceMesh>::halfedge_descriptor sm_halfedge_descriptor;
   typedef typename boost::graph_traits<TargetMesh>::halfedge_descriptor tm_halfedge_descriptor;
 
-  boost::unordered_map<sm_halfedge_descriptor,
+  absl::flat_hash_map<sm_halfedge_descriptor,
                        tm_halfedge_descriptor> hash_map(num_halfedges(sm));
   copy_face_graph_impl(sm, tm,
                        boost::make_assoc_property_map(hash_map),

The results are stuning: now 3.2 seconds instead of 7.1 seconds

Source Code

The .cpp code: https://gist.github.com/2e9ea602a7b1e46cd34773611acac3c6

Here is the local patch to CMakeLists.txt (a non-clean hack):

diff --git a/BGL/examples/BGL_polyhedron_3/CMakeLists.txt b/BGL/examples/BGL_polyhedron_3/CMakeLists.txt
index ac3bd5256f2..a821195ff9c 100644
--- a/BGL/examples/BGL_polyhedron_3/CMakeLists.txt
+++ b/BGL/examples/BGL_polyhedron_3/CMakeLists.txt
@@ -67,6 +67,11 @@ create_single_source_cgal_program( "transform_iterator.cpp" )
 
 create_single_source_cgal_program( "copy_polyhedron.cpp" )
 
+add_subdirectory(googletest)
+add_subdirectory(abseil-cpp)
+
+target_link_libraries(copy_polyhedron PRIVATE absl::flat_hash_map)
+
 if(OpenMesh_FOUND)
   target_link_libraries( copy_polyhedron PRIVATE ${OPENMESH_LIBRARIES} )
 endif()

I had to checkout abseil and gtest:

$ git clone [email protected]:abseil/abseil-cpp.git
$ git clone [email protected]:google/googletest.git

I ran that modified copy_polyhedron.cpp on demo/Polyhedron/data/man.off, and as I said the speedup is huge: 3.2s after the patch to <CGAL/boost/graph/copy_face_graph.h>, instead of 7.1s.

Environment

Operating system (Windows/Mac/Linux, 32/64 bits):
Compiler: g++ 9.1.1 with -O3
Release or debug mode: Release
CGAL version: CGAL-5.0-dev, master version 44e4d1526c506a9de2566be1eb0dfdd44b04d8e5
Boost version: 1.69.0
Other libraries versions if used (Eigen, TBB, etc.):

Jul 02 '19 13:07 lrineau

https://github.com/abseil/abseil-cpp license is Apache License 2.0. It is very permissive.

Jul 02 '19 13:07 lrineau

https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-01-overview/ had some rather detailed comparisons recently, which seem to yield similar observations.

Jul 02 '19 13:07 MaelRL

Shall @maxGimeno propose a global plan with the introduction of CGAL::unordered_map + the cmake machinery to get it everywhere?

Jul 04 '19 07:07 sloriot

abseil is rather inconvenient to use (without bazel or cmake): you need to link about 30 libraries to get a trivial example working :disappointed: But yes, anything that spends a lot of time in hashtable operations should be benchmarked with other implementations. Note that in one extreme case (unordered_set was taking > 64G in this application) I had better results with the standard container than with abseil.

Jul 04 '19 07:07 mglisse

@mglisse commented on Jul 4, 2019, 9:14 AM GMT+2:

abseil is rather inconvenient to use (without bazel or cmake): you need to link about 30 libraries to get a trivial example working :disappointed:

Hopefully CGAL uses CMake.

add_subdirectory(googletest)
add_subdirectory(abseil-cpp)

target_link_libraries(copy_polyhedron PRIVATE absl::flat_hash_map)

But yes, anything that spends a lot of time in hashtable operations should be benchmarked with other implementations. Note that in one extreme case (unordered_set was taking > 64G in this application) I had better results with the standard container than with abseil.

Our opting for Abseil will be conditional to a macro. With it we will easily be able to build benchmarks. Even users should be made aware of that, so that they can tune.

Jul 04 '19 07:07 lrineau

Hopefully CGAL uses CMake.

As a header-only library, it doesn't matter what CGAL uses (if it uses anything at all). Providing CGALConfig.cmake does help. But users of CGAL don't all use cmake.

Jul 04 '19 07:07 mglisse

@mglisse commented on Jul 4, 2019, 9:38 AM GMT+2:

Hopefully CGAL uses CMake.

As a header-only library, it doesn't matter what CGAL uses (if it uses anything at all). Providing CGALConfig.cmake does help. But users of CGAL don't all use cmake.

We do not care. Abseil will be an optional dependency of CGAL, and our CMake users will get it almost for free. As for the others... they can do experiments and benchmarks using CMake, and them make the effort in their build system if the speedup is compelling.

Actually, I do not get your point. Do you have a suggestion about an alternative?

Jul 04 '19 07:07 lrineau

Just my two cents, it would be good to see a higher level CGAL algorithm than copy_face_graph() as a motivation.

Jul 04 '19 08:07 afabri

@mglisse commented on Jul 4, 2019, 9:14 AM GMT+2:

abseil is rather inconvenient to use (without bazel or cmake): you need to link about 30 libraries to get a trivial example working :disappointed:

Actually, can you elaborate? Because on my machine abseil seems to require only the pthread support, and no other dependency.

Jul 04 '19 09:07 lrineau

@afabri it concerns everything related to unordered maps, not only copy_face_graph().

Jul 04 '19 09:07 maxGimeno

The questions is if it has measurable impact on Mesh_3, Constrained_triangulation_2, Arrangement_2......

Jul 04 '19 10:07 afabri

We will probably have to bench to be sure. But I have seen traces of unordered maps at least in Mesh_3 and Arrangements_on_surface_2, so there will probbaly be an impact.

Jul 04 '19 10:07 maxGimeno

For at least one user of us, CGAL::PMP::polygon_soup_to_polygon_mesh was a big consumer of time of their process. And it is all about maps.

Jul 04 '19 10:07 lrineau

Actually, can you elaborate? Because on my machine abseil seems to require only the pthread support, and no other dependency.

I didn't mean external dependencies. Using a trivial program:

#include <absl/container/flat_hash_set.h>
typedef absl::flat_hash_set<int> S;
int main(){ S s; }

Complete the command line with a minimal number of -l flags: g++ test.cc -I .../include -L .../lib Here I need -labsl_hashtablez_sampler -labsl_synchronization -labsl_time -labsl_base -labsl_spinlock_wait -labsl_time_zone -labsl_int128 -labsl_symbolize -labsl_stacktrace -labsl_malloc_internal -labsl_demangle_internal -labsl_debugging_internal -labsl_dynamic_annotations -pthread ... Why they didn't at least create a thin archive or linker script so that -labsl would work is an interesting question.

Jul 04 '19 12:07 mglisse

Actually, I do not get your point.

Just that abseil is a bit inconvenient to use, not as easy as using more Boost stuff for instance (even the parts that require linking with a library).

I don't know yet if you plan to test for the presence of abseil or to include a copy of it.

Do you have a suggestion about an alternative?

Not particularly. Mael posted a link to a comparison between many implementations, some of which are header-only. They don't all have Google maintaining them though, and you may wish to use more stuff from abseil later.

Jul 04 '19 12:07 mglisse

I would love to use the single-header implementation robin_hood.h, that is said to be faster by those benchmarks, but it is not compatible, for the moment, with our use of -frounding-math:

robin_hood.h:1901:59: error: ‘(8.0e+1 / 1.0e+2)’ is not a constant expression
 1901 |         static constexpr double factor = MaxLoadFactor100 / 100.0;
      |                                          ~~~~~~~~~~~~~~~~~^~~~~~~

Jul 04 '19 13:07 lrineau

@lrineau commented on Jul 4, 2019, 3:05 PM GMT+2:

I would love to use the single-header implementation robin_hood.h, that is said to be faster by those benchmarks, but it is not compatible, for the moment, with our use of -frounding-math:
robin_hood.h:1901:59: error: ‘(8.0e+1 / 1.0e+2)’ is not a constant expression
 1901 |         static constexpr double factor = MaxLoadFactor100 / 100.0;
      |                                          ~~~~~~~~~~~~~~~~~^~~~~~~

I have artificially removed -frounding-math, to compile the test, and it is slower than using abseil: 3.8 seconds instead of 3.2 using abseil.

Jul 04 '19 13:07 lrineau

our use of -frounding-math

I don't remember if you tried recently to remove it, to see what breaks. But that's not the right place to discuss it.

I guess constexpr is a context where gcc should ignore rounding modes...

it is slower than using abseil

The list also includes phmap, which is described as essentially a headerified version of abseil.

Jul 04 '19 15:07 mglisse

Hi @lrineau, I've been experimenting with replacing the map in Unique_hash_map with std::unordered_map but initial results showed this to be slower than the chained_map implementation. Probably I did something wrong or I am not understanding the use case of Unique_hash_map correctly:

https://gist.github.com/GilesBathgate/3fb78d3949ca52f4d1a0d14dcc66ae23

Would this benefit from the abseil version of map?

Feb 08 '21 12:02 GilesBathgate

I've been experimenting with replacing the map in Unique_hash_map with std::unordered_map but initial results showed this to be slower than the chained_map implementation.

The whole point of this issue is that std::unordered_map is slow, what did you expect?

Feb 08 '21 13:02 mglisse

@mglisse :joy: Yes, But my reason for investigating it (prior to coming across this PR) was that in some cases, I've found anything other than chained_map to be faster, especially when the number of elements is small because chained_map doesn't lazily initialise minimum of 512 buckets.

For example here: https://github.com/CGAL/cgal/blob/74af3e7d33e7354e4090a1b17941a3f0c372ca8d/Nef_S2/include/CGAL/Nef_S2/SM_const_decorator.h#L392

where visited only contains approximately 6 to 10 elements. The chained_map construction dominates the CPU usage, making O(1) lookup irrelevant probably a std::vector with O(n) would be faster in this scenario. Another example where Unique_hash_map construction is slow is #5143

Feb 08 '21 13:02 GilesBathgate

Hi @lrineau, I've been experimenting with replacing the map in Unique_hash_map with std::unordered_map but initial results showed this to be slower than the chained_map implementation. Probably I did something wrong or I am not understanding the use case of Unique_hash_map correctly:

gist.github.com/GilesBathgate/3fb78d3949ca52f4d1a0d14dcc66ae23_ (too large to embed)_

Would this benefit from the abseil version of map?

Hi Giles, I do not have the time to investigate on your issue. I hope others will, though.

The best way to know if the Abseil hash maps can benefit your code is to try, anyway.

Feb 08 '21 14:02 lrineau

@lrineau Thanks I've given it a try

https://gist.github.com/GilesBathgate/3fb78d3949ca52f4d1a0d14dcc66ae23

@mglisse Abseil was 20% slower than std:unordered_map, for me though so I'm confused now.

Feb 08 '21 14:02 GilesBathgate

@lrineau @mglisse I am getting good results with robin_hood.h I use robin_hood::unordered_node_map (The issue with -frounding math was resolved https://github.com/martinus/robin-hood-hashing/issues/51)

Updated gist: https://gist.github.com/GilesBathgate/3fb78d3949ca52f4d1a0d14dcc66ae23

Feb 11 '21 12:02 GilesBathgate

where visited only contains approximately 6 to 10 elements

Did you try a flat_map for this one? (I know nothing about that code, and in particular not if the size of this container can ever be large) Such small maps are not a typical use case for hash maps, so many are probably not optimized for that case. The code you point to seems suboptimal, it calls visited[v] twice in a row, changing the code a bit, a (hash_)set would make more sense than a map (it probably used a map because there was no set in cgal), checking visited.insert(v).second.

Nice if robin hood works well, although if you want to change Unique_hash_map, which is used a lot in CGAL, and not just use a different map in one specific place, reviewers are likely to ask for various benchmarks, to make sure it doesn't regress some other use cases (bigger table, more expensive key, etc). But it seems quite possible that it is indeed faster all around.

Feb 11 '21 13:02 mglisse

@mglisse Yes, I tried a boost::flat_set and it was much faster I'll add a comment about that to #5379

Regarding Unique_hash_map if I understand correctly it's optimised for when the keys are known to be unique and therefore no collisions. That said robin_hood does seem to out-perform chained_map in my tests, especially in the object construction. Either way, I've found some that some cases require robin_hood::unordered_node_map instead of robin_hood::unordered_flat_map So it doesn't get the cache friendliness benefits. I was thinking of asking @martinus if there is a way to use robin_hood::* in a way that is optimal for unique hash maps.

Feb 11 '21 14:02 GilesBathgate

If you know the maximum size of the hashmap, it is beneficial to call reserve() with that amount. That way you only get a single allocation. Not much else you can do as far as I see on a quick glance.

Also, as @mglisse said, the linked code looks suboptimal. E.g. unordered_flat_set might help, and I think the std::list could use a custom allocator so it doesn't malloc & free each time you put an element into it. A while ago I've created an allocator for that: https://gist.github.com/martinus/ee4cc9d82d4610647180301c47b030a2 see e.g. this benchmark: https://quick-bench.com/q/BLW3OTlodJ2c3TmQGJUPT_TAdGA

Feb 11 '21 14:02 martinus

the std::list could

... be wrapped in a std::queue, which could then use its default backend deque :wink: (I didn't profile that code, no idea what part is relevant to optimize)

Feb 11 '21 15:02 mglisse

If you know the maximum size of the hashmap, it is beneficial to call reserve() with that amount.

So this could be done:

-  CGAL::Unique_hash_map<SVertex_const_iterator,bool> visited(false);
+  robin_hood::unordered_flat_set<SVertex_const_iterator,Handle_hash_function> visited;
+  visited.reserve(number_of_svertices());

but unfortunately:

https://github.com/CGAL/cgal/blob/74af3e7d33e7354e4090a1b17941a3f0c372ca8d/Nef_3/include/CGAL/Nef_3/Vertex.h#L214-L219

:disappointed:

Feb 11 '21 15:02 GilesBathgate

@mglisse Presumably one can use return CGAL::iterator_distance(svertices_begin(), svertices_end()); (or just std::distance) in number_of_svertices?

Feb 11 '21 15:02 GilesBathgate