osm2pgsql icon indicating copy to clipboard operation
osm2pgsql copied to clipboard

Logging falsely suggests CLUSTER is run

Open mboeringa opened this issue 4 years ago • 2 comments

What version of osm2pgsql are you using?

osm2pgsql v1.5.1

What operating system and PostgreSQL/PostGIS version are you using?

Ubuntu 20.04.1 LTS, PostgreSQL 13.4 / PostGIS 3.1.3

Tell us something about your system

What did you do exactly?

Monitored SQL statements generated by osm2pgsql in pgAdmin.

What did you expect to happen?

Currently logging suggests CLUSTER is run in PostgreSQL, however, in reality an "ORDER BY way" statement is used. While the net result of these two options is virtually identical, the suggestion that CLUSTER is run has a side consequence. Any tables that are truly CLUSTERed, can be easily re-clustered using the PostgreSQL clusterdb utility. This is not the case for tables spatially sorted by "ORDER BY way".

Maybe re-phrasing to a slightly more neutral statement like:

"Spatially sorting / optimizing table <X>"

would be better.

What did happen instead?

What did you do to try analyzing the problem?

mboeringa avatar Sep 01 '21 17:09 mboeringa

I agree that we should change the docs here if we are not actually using CLUSTER. But maybe we should use it. Waaay back, osm2pgsql used CLUSTER and then switched away from it. Before changing the docs, we should re-check with modern database versions whether it still is better to not cluster.

joto avatar Dec 20 '21 14:12 joto

CLUSTER requires building an index twice so is generally slower

pnorman avatar Dec 20 '21 18:12 pnorman

Just for the record:

With current PG16 and Facebook Daylights v1.33 1.1B records building dataset, CLUSTER using the geometry spatial index as input is incredibly slow (majority of time spend in the "index scanning heap" phase). As far as I can tell from all tests I ran up to now, the performance of CLUSTER in PostgreSQL deteriorates at more than quadratic rate with increasing size of the table.

E.g. with the +/- 500M buildings of the ordinary Planet file, it took about 7h to spatially defragment the table and rebuild all indexes using CLUSTER, whereas with the 1.1B record dataset, it took >48 hours.

mboeringa avatar Nov 08 '23 14:11 mboeringa