spatialdata icon indicating copy to clipboard operation
spatialdata copied to clipboard

Easily enable sorting points along a Z-order curve

Open keller-mark opened this issue 3 months ago • 2 comments

The problem

The current point representation does not enforce any spatial ordering.

As people are already storing large datasets with tens of millions of points into spatialdata objects, this problem urgently needs a solution.

Here, I aim to share one such solution that requires minimal changes to SpatialData. It would only require that users sort their Point elements in a specific way, and store metadata about the data extent ([x_min, x_max, y_min, y_max]).

One of the main issues for spatial data is the volume of the data, which affects how it is stored and accessed, and what operations will be applied to it. There are two types of operations. The first type of operations process the data in its entirety ... the order in which the data is processed is of importance and this is usually achieved by sorting the data ... The second type of operations process only a small subset of the data as is the case when we access the data at random or the portion of the data that is processed satisfies some predicate which can involve both its spatial and/or attribute components. ... The latter type of processing is important as its efficient execution (other than using brute force to examine every data item) brings into play the need to incorporate notions of sorting as its use is a prerequisite for its efficient retrieval. -- https://www.cs.umd.edu/~hjs/pubs/geoencycl.pdf

Consider the (interactive) visualization use case. The user has zoomed into a small rectangular region.

Image

Without any ordering of the points, we currently need to load all the data and iterate over it if we want to filter the points to only those in the current viewport. We have no way of knowing whether the two points within the viewport are in the first two rows of the dataframe, the last two, or the (first,last) rows, or any pair of rows in between. This approach is O(N), where N is the total number of points.

Describe the solution you'd like

SpatialData should make it easy to sort Point elements according to their Morton Codes (aka, sort along a Z-order curve). ~~I would even go so far as to say that this could be a default for the PointsModel, allowing users to opt-out (rather than opt-in).~~ Users could opt-in via the PointsModel, or they could explicitly run a sorting function provided by the package.

This would entail adding a column morton_code to the Points dataframe and sorting the dataframe accordingly.

I think this should also generalize to 3D points data and cube-based queries (although I have not yet tested it).

Rectangle (or cube) queries using morton codes

Then, with minimal extra information (only the [x_min, x_max, y_min, y_max] of the original point data, and the number of bits used to store the morton codes), we can easily compute a list of morton code intervals corresponding to a given rectangle query region.

Finally, given each (morton_start, morton_end) tuple in the list, we can easily use binary searches to identify the actual row ranges within the dataframe (row_index_start, row_index_end).

The user/tool would leverage the Parquet/Arrow row_groups to load only those row groups needed for the binary searches and the final list of row ranges.

Other small details

  • This may require adjusting the Parquet/Arrow row_groups size used in the output Parquet files in order to get reasonable benefits. For example, the partial reads should not need to read in huge row groups.
  • If the Point elements are annotated by one or more Table elements, these will need to be sorted as well, so that their orderings match.
  • The morton codes could be stored as an extra column of the dataframe, but this aspect is optional. Given the data extent and the number of bits, they can alternatively be dropped and re-computed on-the-fly from the original point X/Y values.
  • This approach could also extend to Shapes elements, for instance using the polygon centroids or circle coordinates.
  • Metadata about how the dataframe is sorted could also be stored on-disk, to help applications/clients determine whether this query strategy can be used. I could imagine alternative sorting strategies such as .sort_values(by=["gene_id", "morton_code"]) could also be useful, but the client would need to know this in order to optimize query performance. The downside of the extra metadata is keeping it in sync in case the underlying data extent/sorting strategy changes, but it may not change a lot for very large datasets.

Additional context Add any other context or screenshots about the feature request here.

Image Image

Describe alternatives you've considered

In #789 there is a discussion of alternatives which require more substantial changes to how the point data is currently stored. Other alternatives could include storing a spatial index tree data structure alongside the point data. The purpose of the current issue is to present a compromise that does not entail changes to the underlying storage format/representation, yet would still provides benefits when performing spatial queries.

keller-mark avatar Aug 22 '25 20:08 keller-mark