osm2pgsql
osm2pgsql copied to clipboard
Possible improvements for new ram middle
The new ram middle (#1461) is better than the previous implementations but there are several places where it could be made even better. Here are some ideas that we can work on as time permits:
Note that many of these are interrelated and related to other upcoming changes in the code. So in many cases it makes sense to defer decisions on what best to do until later when we have a better idea of the environment this code runs in. (In fact this is one reason I am writing this issue instead of improving the code directly, the other is limited time...)
- [ ] The storage for node locations and way node lists is, in the end, a std::string (
node_locations_t::m_data
andmiddle_ram_t::m_way_nodes_data
, respectively). As more data arrives this will be resized again and again. Copying the memory around when that happens isn't that expensive, so it isn't too bad. But while the resizing takes place, we temporarily basically need twice the memory. So when you get near your memory limit, this can break although enough memory would be available if we break the data into smaller chunks. - [x] The
node_locations_t
store writes new locations into a temporary cache (m_block
) which gets written out to the real data storage (m_data
) when full. This could be done directly instead. We don't need thefreeze()
function any more then. - [x] The
node_locations_t
store writes out all ids of one block and then all locations of that block. It might be better to write out all (id, location) pairs in order. - [x] The
node_locations_t::block_size = 32
should be evaluated and tuned. - [ ] Reading node location back from
node_locations_t
is expensive, because we have to read through a lot of varints to find the data we need. We should probably use some kind of cache so that we can reuse decoded blocks. I have not implemented this, because we might want to run the reading code in multiple threads later. If that's the case we might want to have the cache outside thenode_locations_t
class so threads don't trample on each other. Or maybe we need amutex
. - [ ] When two-stage processing is requested by the flex output, the middle needs to store tags (and possibly attributes) of objects (currently only way objects, but in the future also nodes and possibly relations) in addition to the node locations and way node lists. The current implementation simply stores the complete objects as they are in the
osmium::memory::Buffer
which takes a lot of memory. This was the simplest implementation I could think of and two-stage processing isn't used widely, so it is a reasonable compromise for the time being. But we can do much better here, details TBD. - [x] The ram middle could also use the existing disk flat node store or any future disk-based node stores we work on. Or some implementation where the data is stored on disk but we keep the index in memory. This can be improved once we have the new version of the pgsql middle and see better where we can share code.
Regarding "Reading node location back from node_locations_t": Some SIMD-based LEB128 decoders claim to provide faster speeds than the current byte-by-byte approach used in protozero. It might be fast enough to avoid the additional caching of decoded blocks. Lots of research on this topic by https://lemire.me/blog. Potentially interesting Rust implementation: https://github.com/as-com/varint-simd
Besides, the current approach already turned out to be about 10% faster than the old one in my tests. My impression was that fetching node locations isn't one of the major bottleneck in the overall process anymore. This might be different for other test cases, though.
When I worked on protozero I looked at all thoses schemes to improve varint decoding speed but decided against them. I don't want to maintain something like that. This certainly applies much more here.
And yes, it is unclear how much any of these possible improvements would actually help and we'd need to look at the resulting code and benchmarks to see whether they are worth it.
The node_locations_t::block_size = 32
turns out to be okay. With = 16
the memory use is about twice as big, with = 64
we save only about 1-5% memory. The runtime difference between 32 and 64 is minimal and vanishes in the noise. Lets stick with 32.
The first item above (that node_locations_t::m_data
is a std::string
which will be resized) has an additional problem: If there isn't enough memory in a std::string
it is typically resized by doubling the capacity. This can mean that, in the worst case, we use twice as much memory for the cache as was configured. That's not good.
To fix this we need to either implement the resizing of m_data
ourselves, i.e. get a new std::string
of the size we want, copy the data over and swap with m_data
. Or we need to reserve the data in chunks.