Compression (spec v3)
Options:
- Gzip compression - requires a library like pako, may be expensive
- Cap'n proto packing: https://capnproto.org/encoding.html
- Protobuf varints: https://developers.google.com/protocol-buffers/docs/encoding
Use cases:
- Dense tile pyramids
- Sparse pyramids (tippecanoe output)
add tile type as a required metadata field (png, jpg, mvt, etc)
Flaws in current design:
- 512000 fixed-size header is wasteful
- Index performs poorly for certain cases (panning at leaf level and leaf+level 1, especially)
- Waste of ID space ZXY
New design:
- All internal ID storage is based on a Hilbert Tile ID
- Leaf directories are a configurable-size batching of the ID space (by default again 21845)
- The first 21845 entries are top-level entries, recognizing that overview tiles are more frequently accessed
- Leaf directories can be batched recursively: see FlatGeobuf https://worace.works/2022/03/12/flatgeobuf-implementers-guide/
- Offsets in indexes should be relative to the start of the data section, allowing relocation
- TileId, Offset, Length should be delta-encoded before gzip-compression.
- A metadata flag
clustered:trueindicates that the tile order on disk matches TileId order - Mandate GZIP for vector tile content (ensure edge can re-encode to Brotli efficiently)
Unsolved problems:
- Relocation problem with offset of directory IDs (directories store "leaf level" offset?)
- Specific algorithm for clustering while also working around deduplication
- Should indexes go at the end or the beginning?
- How to store "directory" bit
target metric: (total # of tiles in archive / size of index in bytes) = average number of bytes per tile entry Currently this is 17, 3-5 bytes per entry is what my experimental results are...can we do better?
Parquet encodings: https://parquet.apache.org/docs/file-format/data-pages/encodings/
- extend spec to compress entire subtrees (ocean tiles) ?
- Move certain fields into header instead of metadata, to avoid blocking on large metadata
bbox,minzoom,maxzoom,tile_type,compression,clustered
- Benchmark against Parquet size
Consider if we should add leaders/trailers: https://gdal.org/drivers/raster/cog.html
the COG 16KB assumptions seem good
Ghost sections / extensions, example: storing offset->hash in a ghost section to enable efficient diffing of two PMTiles archives