Improving efficiency of the node location store

Open lonvia opened this issue 4 months ago • 1 comments

The node location store currently has a very simple array structure which makes it easy to use and and very fast to access. The downside is that it needs nearly 100GB storage, no matter if you are importing a planet or just a small extract. We need a data structure that better compresses the data and works well for full planets and extracts alike.

Aug 22 '25 18:08 lonvia

(copied from https://github.com/osm2pgsql-dev/osm2pgsql/issues/1466#issuecomment-2882929246)

Here are some additional considerations regarding memory efficiency for the node location store:

The current RAM middle implements the node location store roughly as this: Locations are stored in blocks of 32 nodes. Inside each block the ids of nodes and the x and y coordinates are stored delta encoded and using varints. Overall we need about 8 to 9 bytes per stored node for this, for larger files (planet) it is a bit less, for smaller files a bit more.

This doesn't look that great. The naive implementation (which we use for the on disk flat node files) uses 8 bytes x the highest node id. So for the planet file we are really close to that. Of course for smaller files the RAM middle is much more efficient because it only needs the 8 bytes for actually stored nodes and does not depend on the highest node id.

Can we find a more efficient encoding than the delta/varint encoding? What makes the delta/varint work reasonably well is that we have runs of nodes that are really close to each other, because they were created together for OSM ways. But they are not that close either, so varint might not be the best encoding. And we don't always have these runs of close nodes. And varint encoding for essentially random numbers does not work well.

We should explore other encoding options. One possible idea is to encode the x and y coordinates together, interleaving the bits from the x and y coordinates, like the index in the main OSM database does it. Then we can store the higher order bits that are the same for all (or many) of the nodes in a block separately from the lower order bits. This could be done either with a fixed or a dynamic separation between those higher/lower order bits. Some back-of-the-envelope calculation shows that this might save some space, but we'd need to implement it to see whether that's the case with actual data.

We could also have different implementations for the blocks depending on the node locations. For each block the most efficient encoding could be chosen, similar to how compression engines sometimes work. Of course this adds more fixed overhead to store that information...

Aug 22 '25 18:08 lonvia