Feature : ST_ClusterDBSCAN implementation
PostGIS ST_ClusterDBSCAN is a really powerfull function that is only available in PostGIS. This function return cluster number from a list of geometry based on DBSCAN.
https://postgis.net/docs/ST_ClusterDBSCAN.html
Is there any plan to work on it? As well as on ST_ClusterIntersect and ST_ClusterKMeans?
Yes, I'm waiting for some changes in DuckDBs sorting and partitioning code to land, after which I will try to integrate these into spatial.
We're also hoping to use the ST_ClusterDBSCAN. @Maxxen any idea if this is something that is weeks away or are we looking at months/year out?
Hard to say for sure but Im currently reworking distance computations in general, after which a naive in-memory DBSCAN shouldn't be too hard to implement. The larger sorting and partitioning changes required to make it really performant are probably not going to be in place until next major release.
ST_ClusterDBSCAN is very lacking
Happen to have any updates here? We're looking at a migration to DuckDB but we use this function pretty heavily with PostGIS today and there's not really any other equivalent of the algorithm.
Ive got ST_ClusterDBScan working from spatial's side, but it requires changes in DuckDB core in order to force the window function executor to always call the right callbacks. Im going to talk to the rest of the team to figure out how to proceed, and after that should be able to get it in for DuckDB v1.4 scheduled for September.
One thing to consider here is the ability to call partition also as alot of the time you're unable to call DBScan on larger areas.
Ive got ST_ClusterDBScan working from spatial's side, but it requires changes in DuckDB core in order to force the window function executor to always call the right callbacks. Im going to talk to the rest of the team to figure out how to proceed, and after that should be able to get it in for DuckDB v1.4 scheduled for September.
That would super helpful! There's no strong replacement algorithm available for us so having it in place on the next major release is exciting.
@fhk would there be a reason you can't call DBSCAN on larger areas? It does have density/distance function contraints, of course, but we haven't really seen any issues with other implementations of DBSCAN, even when it comes to large datasets. It should be able to run in O(n log n) time if implemented properly.
Ah ! Depends on your usecase and what you mean by big.
Alot of the time you want to cluster more than a million points.
My current method of choice for this is to use BigQuery or cuml.
Specifically running a whole US state or even a Country.
Typically there will be too much GEOMETRY to fit it all in memory. So people work around that by either pre partitioning the data or doing it in the query like so - https://stackoverflow.com/questions/68811457/clustering-using-dbscan-in-bigquery
@fhk interesting, I wonder if DuckDB's higher memory limits may help in this case - seems like BigQuery hits issues after 1GB in a partition.
I couldn't find any DuckDB documentation about any limits on per-partition memory, so this could still be a factor if it couldn't use all addressable space for some reason.
Ive got ST_ClusterDBScan working from spatial's side, but it requires changes in DuckDB core in order to force the window function executor to always call the right callbacks. Im going to talk to the rest of the team to figure out how to proceed, and after that should be able to get it in for DuckDB v1.4 scheduled for September.
Hi @Maxxen , any chance this actually landed in 1.4? I see that version was released today but this issue is still marked as open, so I’m guessing it didn’t make the cut.