cuspatial
cuspatial copied to clipboard
[BUG] quadtree_point_in_polygon.cu, bug and fix
1. Bug description
Version: 22.12 and 22.10: will use 22.12 to walk through' We will show the cause of this bug and how we fixed it. Note: to trigger the bug, one needs to set the scale appropriately to create deep quad-tree
We are using the quad-tree based point to polygon intersection, cuspatial.quadtree_on_points cuspatial.join_quadtree_and_bounding_boxes cuspatial.quadtree_point_in_polygon
and the last step, quadtree_point_in_polygon, returns wrong indices for points.
https://github.com/rapidsai/cuspatial/blob/c4b1440fbac1b7a186ef9a4765ced1d1f1d14312/cpp/include/cuspatial/spatial_join.hpp#L100-L109
quadtree_point_in_polygon supposes to return point_offset, which should be the indices into (point_x, point_y). (point_x, point_y) are column data from the original table, so point_offset is just the row_ID of the original table. And we use the indices to join tables, etc.
2. The cause of the bug
The bug is that the point_offset is wrong, it actually returns the indices to point_indices instead of (point_x, point_y) point_indices is parted of the quad-tree leaf node storage (sorted by morton code), which points to(point_x, point_y)
The bug is in cpp/src/join/quadtree_point_in_polygon.cu here's the relevant parts of the code
The copy_if copies from global_to_poly_and_point_indices to poly_and_point_indices (which is the final output), if points intersect with polygon https://github.com/rapidsai/cuspatial/blob/c4b1440fbac1b7a186ef9a4765ced1d1f1d14312/cpp/src/join/quadtree_point_in_polygon.cu#L181-L190
global_to_poly_and_point_indices iterator is created by apply compute_poly_and_point_indices https://github.com/rapidsai/cuspatial/blob/c4b1440fbac1b7a186ef9a4765ced1d1f1d14312/cpp/src/join/quadtree_point_in_polygon.cu#L170-L176
and compute_poly_and_point_indices https://github.com/rapidsai/cuspatial/blob/c4b1440fbac1b7a186ef9a4765ced1d1f1d14312/cpp/src/join/quadtree_point_in_polygon.cu#L53-L69
quad_point_offsets contains the quad tree's leaf nodes' offset to the point_indices, so this functor just returns indices to point_indices (https://github.com/rapidsai/cuspatial/blob/c4b1440fbac1b7a186ef9a4765ced1d1f1d14312/cpp/src/join/quadtree_point_in_polygon.cu#L65)
as one can see , compute_poly_and_point_indices returns indices to point_indices so global_to_poly_and_point_indices returns indices to point_indices and finally it just copies global_to_poly_and_point_indices to poly_and_point_indices, which is the final output, that's how we get indices to point_indices instead of point_x y.
3. The fix
pass point_indices to compute_poly_and_point_indices
then return the content of point_indices (indices to point x y )
template <typename T, typename QuadOffsetsIter> struct compute_poly_and_point_indices { QuadOffsetsIter quad_point_offsets; uint32_t const* point_offsets; uint32_t const* point_offsets_end; cudf::column_device_view const poly_indices; cudf::column_device_view const point_indices;
inline thrust::tuple<uint32_t, uint32_t> device operator()(cudf::size_type const global_index) const { // uint32_t quad_poly_index, local_point_index; auto const [quad_poly_index, local_point_index] = get_quad_poly_and_local_point_indices(global_index, point_offsets, point_offsets_end); uint32_t const point_idx = point_indices.element<uint32_t>(quad_point_offsets[quad_poly_index] + local_point_index); uint32_t const poly_idx = poly_indices.element<uint32_t>(quad_poly_index); return thrust::make_tuple(poly_idx, point_idx); } };
change point_xys_iter into iterator on point x y
auto point_xys_iter = thrust::make_zip_iterator(point_x.begin<T>(), point_y.begin<T>());
pass new arg to compute_poly_and_point_indices
auto global_to_poly_and_point_indices = detail::make_counting_transform_iterator(
0,
compute_poly_and_point_indices<T, decltype(quad_offsets_iter)>{
quad_offsets_iter,
local_point_offsets.begin(),
local_point_offsets.end(),
*cudf::column_device_view::create(poly_indices, stream),
*cudf::column_device_view::create(point_indices, stream)
});
this works because global_to_poly_and_point_indices now returns indices to point x y directly, which is passed to test_poly_point_intersection, then we changed PointIter points; to simple iterator on point x,y, so this gives us the actual data for point testing
https://github.com/rapidsai/cuspatial/blob/c4b1440fbac1b7a186ef9a4765ced1d1f1d14312/cpp/src/join/quadtree_point_in_polygon.cu#L71-L92
and global_to_poly_and_point_indices is if_copied back to user, which suppose to have indices to point x y. so this is the correct output.
We verified it with our spatial join cases, the above fix works. It produces the correct results compared to geopandas sjoin
In the documentation on " @note The returned polygon and point indices are offsets into the poly_quad_pairs inputs and (point_x, point_y), respectively", (point_x, point_y) pairs are sorted based on Morton code. To retrieve the original points, a thrust::gather op can be applied. Please see https://github.com/rapidsai/cuspatial/blob/37d3db570c52308468a6740ef3227717afc5b256/cpp/tests/join/point_in_polygon_test_large.cpp#L167
A python example using NYC taxi trip data and cudf.take python API is at https://github.com/zhangjianting/cuspatial_benchmark_nyctaxi/blob/fd88bae921d6eac74e176ab885ed9f5ac73220b9/python/spatial_join_verify.py#L116-120
Hi,
Regarding (point_x, point_y) pairs are sorted based on Morton code.
No, one shouldn't expect (point_x, point_y) to be sorted. The API didn't require sorted point inputs, and it's unnecessary.
Here's the explanation
As one can see, we pass those 3 args to quadtree_point_in_polygon https://github.com/rapidsai/cuspatial/blob/c4b1440fbac1b7a186ef9a4765ced1d1f1d14312/cpp/include/cuspatial/spatial_join.hpp#L85-L87
(point x, y ) are just data from user like (dataframe["longitude"], dataframe["latitude"]), the indices of point x,y are row _IDs of the original dataframe columns, so we can merge dataframes using the returned results. point_indices is an array of indices pointed to (point x, y), and point_indices is sorted by morton code So if (point x, y ) is expected to be sorted, why do we need point_indices?
(point x, y ) and point_indices should have different orders, otherwise, no need for point_indices and the problem with current code is that it returns indices to point_indices instead of point x,y as stated in API
point_indices is from phase 1 quad tree construction
see how compute_point_keys_and_sorted_indices return point_indices https://github.com/rapidsai/cuspatial/blob/37d3db570c52308468a6740ef3227717afc5b256/cpp/include/cuspatial/experimental/detail/indexing/construction/phase_1.cuh#L252-L267
within compute_point_keys_and_sorted_indices, the sorted point_indices by key (morton code) https://github.com/rapidsai/cuspatial/blob/37d3db570c52308468a6740ef3227717afc5b256/cpp/include/cuspatial/experimental/detail/indexing/construction/phase_1.cuh#L87-L93
clearly, if point x y are expected to be sorted by morton code, then we don't need this step in quad tree construction
@KaiWang1337 My first comment on the input of sorted points was based on an old implementation, sorry for the confusion. In the public code, the x/y columns were desgined to be immutable.
The second commnet seems still valid. The docstring on "@note The returned polygon and point indices are offsets into the poly_quad_pairs inputs and (point_x, point_y), respectively", the point indices offsets seems to be the offsets into the SORTED x/y columns. As the python example shows, the offset into the ORIGINAL x/y columns can be recovered using a gather operator, which should be used to access the orginal x/y columns. I guess this is essentially what you have proposed.
for the C++ test example, cudf::gather was used to change the order of the input x/y columns [https://github.com/rapidsai/cuspatial/blob/37d3db570c52308468a6740ef3227717afc5b256/cpp/tests/join/point_in_polygon_test_large.cpp#L167] which serves the same purpose with respect to testing correctness.
There might be a historical reason on the decision to return offsets into the sorted x/y columns (vs. the offset into original input x/y columns). The correctness of the desgin/implmentation has been verified intensively. The docstring may need to be updated though.
hi @zhangjianting yes, I agreed the design is correct besides the point indices return part.
The API clearly says returned point indices are indices to point x y, which is expected to be unsorted in current implementation, which I believe is the intention.
It didn't say anywhere that the return is to the point_indices array and needing of gathering.
https://docs.rapids.ai/api/libcuspatial/stable/group__spatial__join.html#ga503dd7c07f7f557fd51cd68a7c0c2165
quote
The returned point and polyline indices are offsets into the (point_x, point_y) and poly_quad_pairs inputs, respectively.
ed
point x y is expected to be unsorted in current implementation as one can see, in compute_quadtree_point_in_polygon https://github.com/rapidsai/cuspatial/blob/c4b1440fbac1b7a186ef9a4765ced1d1f1d14312/cpp/src/join/quadtree_point_in_polygon.cu#L154-L157 it use point_indices (indices to point x y, sorted by morton as key) as permutation iterator get the original indices of point x y (if point x y is already sorted, then we don't need to do that)
The fixed I propose is just to return indices to point x y (original, unsorted) as this is the behavior stated by the API.
Sorting point x y (or in this case, just the indices) is a part of quad tree implementation, the user would be expect indices to the original column for things like spatial join
@KaiWang1337 I fully agree that the docstring should match exactly what the API does. I am sure the development team would be able to look into the issue and make a decision. Not sure whether you are developing on top of cuspatial or comparing against it. In etiher case, if possible, can you share a littlle bit more information about your work?
@KaiWang1337 it would be a fantastic contribution if you could submit your fix as a PR for us to review! Thanks!