parallel-hashmap icon indicating copy to clipboard operation
parallel-hashmap copied to clipboard

Custom pointer type support ?

Open kleunen opened this issue 4 years ago • 15 comments

It seems custom pointer types are not supported with the parallel hashmap: https://github.com/greg7mdp/parallel-hashmap/blob/master/parallel_hashmap/phmap.h#L843-L846

Would this be difficult to add ? Because I would like to apply the parallel hashmap to a project to process openstreetmap data. This involves loading several billions of geometrical data. We do this by loading the the data into a boost unordered_map backed by a boost interprocess mmap file. This way, the data is hashed but backed by a file on filesystem. This is required, because converting the planet actually involved processing 60G of compressed data.

The loading process is now completely single threaded, but ideally, we would like to load the node/ways/relations store with openstreetmap from the planet osm file in parallel. Parallel hashmap provides an interesting approach for this, but this approach would require using the boost interprocess offset pointer.

https://github.com/systemed/tilemaker

Is there a particular reason why custom pointer types are not allowed ?

kleunen avatar Mar 27 '21 09:03 kleunen

This is a restriction I inherited from the original Abseil source code. Can you tell me what kind of pointer type you would like to use?

greg7mdp avatar Mar 27 '21 16:03 greg7mdp

Boost Interprocess Offset pointer: https://www.boost.org/doc/libs/1_75_0/doc/html/interprocess/offset_ptr.html

Please observe the following: One of the big problems of offset_ptr is the representation of the null pointer. The null pointer can't be safely represented like an offset, since the absolute address 0 is always outside of the mapped region. Due to the fact that the segment can be mapped in a different base address in each process the distance between the address 0 and offset_ptr is different for every process.

Some implementations choose the offset 0 (that is, an offset_ptr pointing to itself) as the null pointer pointer representation but this is not valid for many use cases since many times structures like linked lists or nodes from STL containers point to themselves (the end node in an empty container, for example) and 0 offset value is needed. An alternative is to store, in addition to the offset, a boolean to indicate if the pointer is null. However, this increments the size of the pointer and hurts performance.

In consequence, offset_ptr defines offset 1 as the null pointer, meaning that this class can't point to the byte after its own this pointer:

kleunen avatar Mar 27 '21 16:03 kleunen

Thanks! Maybe it is doable to support this. I'm wondering how I could test it. Would you have a small example with a boost unordered_map/set using a custom pointer type allocator?

greg7mdp avatar Mar 27 '21 16:03 greg7mdp

Yes, i can make small example

kleunen avatar Mar 27 '21 16:03 kleunen

This is the easy case: https://wandbox.org/permlink/uFzkywyr2StzUq4i

But we actually need scoped allocation with piecewise construction as well, i can add example of this later.

kleunen avatar Mar 27 '21 16:03 kleunen

Great, thanks @kleunen, let me look at that first!

greg7mdp avatar Mar 27 '21 16:03 greg7mdp

This is with a scoped allocation example: https://wandbox.org/permlink/08K4Pd3wVUcscwJG

Vector in map using a scoped allocator

kleunen avatar Mar 27 '21 17:03 kleunen

Thanks. Please give me a couple days to look into it.

greg7mdp avatar Mar 27 '21 18:03 greg7mdp

I cleaned up the example a little bit more: https://wandbox.org/permlink/21FjLvAenx9j76P9

kleunen avatar Mar 27 '21 18:03 kleunen

Hello @greg7mdp, do you think this is possible ? Or are there any show-stopping issues ?

kleunen avatar May 07 '21 16:05 kleunen

Hum, I'm not sure. I did try your example with a change in phmap but I got a very strange error. I need to look more into it.

image

greg7mdp avatar May 07 '21 16:05 greg7mdp

What was the error you got ?

kleunen avatar May 07 '21 17:05 kleunen

I actually worked around the issue now, by implementing a custom allocator, which keeps the mmap region open, even after resizing. This way, the pointers don't need to be offset pointers, because the old region also stayed mapped.

I do still think it would be useful if parallel-hashmap supports custom pointers. I think you should consistently use Alloctor::pointer and not convert to raw pointers.

kleunen avatar May 29 '21 05:05 kleunen

I think you should consistently use Alloctor::pointer and not convert to raw pointers.

How can I do that since I allocate the hashmap array itself using the allocator and need to access the entries into this array?

greg7mdp avatar May 29 '21 11:05 greg7mdp

You don't actually have to manage the array yourself, you can use std::vector to allocate the array of pairs for you. You can make the actual storage container a template argument:

using entry_t = std::pair<int, int>;
using entry_allocator_t = std::allocator< std::pair<int, int> >;

// Storing entries
template<class T, class A>
using bucket_storage_t = std::vector<T, A>;

int main()
{   
    bucket_storage_t<entry_t, entry_allocator_t> entries(10);
    for(int i = 0; i < 10; ++i)
        entries[i] = std::make_pair(i, i); 
}

In the cases of the parallel hash map, you also need a storage type and allocator for the vector of vectors:

#include <utility>
#include <vector>

// Allocate entries 
using entry_t = std::pair<int, int>;
using entry_allocator_t = std::allocator< std::pair<int, int> >;
using bucked_list_allocator_t = std::allocator< std::vector< std::pair<int, int> > >;

// Storing entries
template<class T, class A>
using bucket_storage_t = std::vector<T, A>;

// Storing list of buckets
template<class T, class A>
using bucket_list_storage_t = std::vector< bucket_storage_t<T, entry_allocator_t>, A>;

int main()
{   
    bucket_list_storage_t<entry_t, bucked_list_allocator_t> bucket_list(10);
    for(int i = 0; i < 10; ++i) 
        bucket_list.resize(100);
} 

But possibly you can also use a deque for this? No wait. The outer list is a static array in your case.

kleunen avatar May 29 '21 12:05 kleunen