robin-map it = map.erase(it) potentially passes entries twice

I've stumbled upon a bug in my robin_hood map which also uses backward shift deleting, and wondered how you solve this issue in tsl::robin_map, and it seems you have exactly the same issue. (see https://github.com/martinus/robin-hood-hashing/issues/42)

The problem is that due to backward shift deleting, when iterating a map and erasing elements, it is possible that the same elements will be iterated over twice.

Here is a reproducer, where I insert two elements into the map. When erasing the second element, the first element will wrap around and be iterated over a second time. I currently think it is best to just claim that's the way it is, because I am afraid there is no easy workaround (except not using backward shift deleting)

template <typename T>
struct BadHash {
    size_t operator()(T const& val) const {
        return static_cast<size_t>(val);
    }
};

int main(int, char**) {
    tsl::robin_map<int, int, BadHash<int>> map;
    map.reserve(16); // resizes with mask 31

    map.emplace(31, 1);        // gets into last bucket
    map.emplace(1024 + 31, 2); // would also get into last bucket, but wraps around

    // first entry in the map
    auto it = map.begin();
    std::cout << it->first << "->" << it->second << std::endl;

    // second entry in the map, erase it
    ++it;
    std::cout << it->first << "->" << it->second << std::endl;
    it = map.erase(it);

    // now all two elements are iterated, we should be at the map's end, but due to backward shift
    // deletion we are not
    if (it != map.end()) {
        std::cout << it->first << "->" << it->second << std::endl;
    }
    it = map.erase(it);
    std::cout << "are we now at the end? " << (it == map.end()) << std::endl;
}

I wonder what you think about that issue?

Jul 31 '19 18:07 martinus

Hello,

Thank you very much for the report. I effectively completely missed that when implementing the backward shift deletion.

I started the implementation of a fix in the fix_issue_20 branch. I still have to check a bit if everything works fine and improve the documentation before merging the changes.

Basically I check if I wrap around during the backward shift deletion and if it's the case i just return end().

Thibaut

Jul 31 '19 21:07 Tessil

I think simply checking for a wrap around is not enough. E.g when you insert 3 elements at location 30, 2 elements are at the end and one will wrap around. When erasing the second to last, backward shift will wrap around but there is still one element left to be iterated

Aug 01 '19 05:08 martinus

Effectively I went a bit too fast with the proposed solution, it will not work in multiple cases. I'll check this evening to see how we can do that.

Aug 01 '19 06:08 Tessil

Hi,

I checked a bit and currently I see the following possible solutions:

Keeping the same behaviour as std::unordered_map.
- Don't wrap around, we just add a buffer at the end of the buckets (that is not counted in the bucket_count) to store the elements that would have wrapped around. We rehash the hash table if the buffer gets full regardless of the current load factor. The problem is the risk of memory DoS attacks as an attacker may force the hash table to rehash by inserting elements at the end.
- When creating an iterator, add a reference to the last current element of the hash table except on erase where we propagate the reference to the returned iterator. If on erase we see that the element of the iterator is the same as the reference to the last element in the iterator we return end() after erasing it instead of the next element.
- When calling begin(), return the first element that was not wrapped around. When incrementing the iterator, wrap around and continue until the last wrapped element
Different behaviour compare to std::unordered_map.
- Leave it as it is and document the difference. When using erase in a loop it's usually to erase an element that meets a condition (e.g. if(it->first % 2 == 0) it = erase(it);) and not erase everything that come after an element.
- Just don't return anything, we just change the return type to void. Not really viable as erasing while iterating can be quite useful.

The problem is more complex than expected and I have to think a bit if I can find another solution or which solution to pick-up.

Thibaut

Aug 10 '19 11:08 Tessil

I also thought about adding an overflow buffer at the end. This might have a slight performance advantage because less need to & (tablesize-1). The DoS attack could also be mitigated with a hash that e.g. uses the this pointer, then the positions become less predictable.

Your iterator idea brings me to a new idea: How about instead of a reference, add a counter of the number of remaining items. begin() sets the counter to size(), and end() sets the counter to 0. Each ++it or it = erase(it) decrease the counter, comparing iterators just need to compare the counter. That needs no specil wrap-around handling, and begin() and end() stay fast operations

Aug 11 '19 05:08 martinus

Adding a counter would also be possible but the problem is that you also need to initialize it in other methods than begin(). For example on insertion you need to return an iterator to the inserted element, how can you get the remaining size until end() efficiently?

I also realised that a reference to the last element is not really possible either as it may be invalidated. It also has the same problem as with a counter, how do you efficiently get a reference to the last element in some methods? You could eventually maintain a pointer to the last element in the class and update it on each insert/erase. But it's getting quite complex.

Aug 11 '19 08:08 Tessil

robin-map robin-map copied to clipboard

it = map.erase(it) potentially passes entries twice

robin-map
robin-map copied to clipboard