osmflat-rs icon indicating copy to clipboard operation
osmflat-rs copied to clipboard

metrics about file size

Open snuup opened this issue 3 years ago • 5 comments

What size does such file have if it holds the data from a planet.pbf file? Are there any other metrics you can give to estimate performance or size?

Thank you

snuup avatar Mar 28 '22 20:03 snuup

Data size depends on the compression used. The best compression you would get by using something like Shuffly (tar -cf - my_folder | shuffly -e | pzstd -o my_folder.tar.shuffly.pzstd)

Here are some numbers for different compressions/raw with and without the optional OSM identifier subarchive:

62G planet-220103.osm.pbf
46G planet-220103.flatdata.tar.shuffly.zst (with Ids)
83G planet-220103.flatdata.tar.zst (without Ids)
96G planet-220103.flatdata.tar.zst (with Ids)
169G planet-220103.flatdata (without Ids)
208G planet-220103.flatdata (with Ids)

Interestingly osmflat is much smaller than pbf when using shuffly + zstd (any replacement for zstd would work, as shuffly makes data more compressible for any dictionary based algorithm), even though that was not the main goal (performance / random access was).

Regarding performance: osmflat gives you O(1) random access to the data. If that is something your processing pipeline needs you might get an order of magnitude faster processing times and less memory footprint. Example include e.g.: Resolving node references when processing ways, or building a routing graph. The pbf format requires you to build lookup tables in memory, process data multiple times, or other types of tricks. Do you have some specific example in mind we could benchmark? The examples folder has many which mirror the Osmium ones, and many of those are 10x+ faster (some much more than that, but that is due to the fact that PBF does not store much meta-data, e.g. number of ways).

Another benefit of having random access to data is that parallelizing processing is much more trivial.

The biggest downside would be that it requires a larger disk footprint after downloading.

Being built upon the cross-language IDL flatdata also has its benefits: No manual code shifting around bits/etc is needed, multiple languages are supported fromt he get-go, and each archive is self-describing.

VeaaC avatar Mar 29 '22 08:03 VeaaC

Thank you for your detailed reply. I defined my own binary format "FlatMap" many years ago, refined it over the years and published as it at FOSSGIS 2022 conference. Size is

  FlatMap Pbf (no meta, locations on ways)
uncompressed 72.125.314.843 84.920.749.729
compressed bz2 55.907.932.608 54.290.110.221

I use it uncompressed via memory mapping which gives (below) microsecond access to nodes/ways/relations . Only for transport I would compress it. It holds exactly the OSM data as in the planet.pbf but no metadata, but puts locations into ways, keeping nodeids, for development and debugging purposes. It is not a geo but an OSM format which also manifests in the 4 byte = 100 nanodegree resolution for lon/lat.

snuup avatar Apr 01 '22 19:04 snuup

Nice! Having only 70GB "at rest" can make FlatMap very useful for some applications.

It looks like the biggest difference between osmflatand FlatMap is that FlatMap employs "some" compression always (var-length, etc), whereas osmflatis fully decompressed and has no need for OSM ids (they are optional). osmflat's random access speed mostly depends on I/O / cache, and best case can be as fast as a normal array access (nano-seconds). The actual impact on data processing would depend a lot on the actual usage I guess, though. I imagine that FlatMap's inlining of nodes gets rid of a lot of random access already. Finding shared nodes will still be required (e.g. to build a routing graph), or resolving relations.

If you want to we could set up a simple benchmark (e.g. building a routing graph), and test it on all 3 formats?

VeaaC avatar Apr 02 '22 07:04 VeaaC

Thank you, looks like we are technically on the same level and did some similar and some different decisions. It would be fruitful to exchange and compare. I am busy with other things and will come back here later.

cheers

snuup avatar Apr 03 '22 21:04 snuup

FYI: https://github.com/boxdot/osmflat-rs/pull/70 makes the schema a bit more compact (especially if compressed with shuffly).

VeaaC avatar Oct 26 '22 06:10 VeaaC