cuspatial icon indicating copy to clipboard operation
cuspatial copied to clipboard

[BUG] quadtree_point_in_polygon.cu, bug and fix

Open KaiWang1337 opened this issue 3 years ago • 5 comments

1. Bug description

Version: 22.12 and 22.10: will use 22.12 to walk through' We will show the cause of this bug and how we fixed it. Note: to trigger the bug, one needs to set the scale appropriately to create deep quad-tree

We are using the quad-tree based point to polygon intersection, cuspatial.quadtree_on_points cuspatial.join_quadtree_and_bounding_boxes cuspatial.quadtree_point_in_polygon

and the last step, quadtree_point_in_polygon, returns wrong indices for points.

https://github.com/rapidsai/cuspatial/blob/c4b1440fbac1b7a186ef9a4765ced1d1f1d14312/cpp/include/cuspatial/spatial_join.hpp#L100-L109

quadtree_point_in_polygon supposes to return point_offset, which should be the indices into (point_x, point_y). (point_x, point_y) are column data from the original table, so point_offset is just the row_ID of the original table. And we use the indices to join tables, etc.

2. The cause of the bug

The bug is that the point_offset is wrong, it actually returns the indices to point_indices instead of (point_x, point_y) point_indices is parted of the quad-tree leaf node storage (sorted by morton code), which points to(point_x, point_y)

The bug is in cpp/src/join/quadtree_point_in_polygon.cu here's the relevant parts of the code

The copy_if copies from global_to_poly_and_point_indices to poly_and_point_indices (which is the final output), if points intersect with polygon https://github.com/rapidsai/cuspatial/blob/c4b1440fbac1b7a186ef9a4765ced1d1f1d14312/cpp/src/join/quadtree_point_in_polygon.cu#L181-L190

global_to_poly_and_point_indices iterator is created by apply compute_poly_and_point_indices https://github.com/rapidsai/cuspatial/blob/c4b1440fbac1b7a186ef9a4765ced1d1f1d14312/cpp/src/join/quadtree_point_in_polygon.cu#L170-L176

and compute_poly_and_point_indices https://github.com/rapidsai/cuspatial/blob/c4b1440fbac1b7a186ef9a4765ced1d1f1d14312/cpp/src/join/quadtree_point_in_polygon.cu#L53-L69

quad_point_offsets contains the quad tree's leaf nodes' offset to the point_indices, so this functor just returns indices to point_indices (https://github.com/rapidsai/cuspatial/blob/c4b1440fbac1b7a186ef9a4765ced1d1f1d14312/cpp/src/join/quadtree_point_in_polygon.cu#L65)

as one can see , compute_poly_and_point_indices returns indices to point_indices so global_to_poly_and_point_indices returns indices to point_indices and finally it just copies global_to_poly_and_point_indices to poly_and_point_indices, which is the final output, that's how we get indices to point_indices instead of point_x y.

3. The fix

pass point_indices to compute_poly_and_point_indices then return the content of point_indices (indices to point x y )

template <typename T, typename QuadOffsetsIter> struct compute_poly_and_point_indices { QuadOffsetsIter quad_point_offsets; uint32_t const* point_offsets; uint32_t const* point_offsets_end; cudf::column_device_view const poly_indices; cudf::column_device_view const point_indices;

inline thrust::tuple<uint32_t, uint32_t> device operator()(cudf::size_type const global_index) const { // uint32_t quad_poly_index, local_point_index; auto const [quad_poly_index, local_point_index] = get_quad_poly_and_local_point_indices(global_index, point_offsets, point_offsets_end); uint32_t const point_idx = point_indices.element<uint32_t>(quad_point_offsets[quad_poly_index] + local_point_index); uint32_t const poly_idx = poly_indices.element<uint32_t>(quad_poly_index); return thrust::make_tuple(poly_idx, point_idx); } };

change point_xys_iter into iterator on point x y auto point_xys_iter = thrust::make_zip_iterator(point_x.begin<T>(), point_y.begin<T>());

pass new arg to compute_poly_and_point_indices

auto global_to_poly_and_point_indices = detail::make_counting_transform_iterator(
    0,
    compute_poly_and_point_indices<T, decltype(quad_offsets_iter)>{
      quad_offsets_iter,
      local_point_offsets.begin(),
      local_point_offsets.end(),
      *cudf::column_device_view::create(poly_indices, stream),
      *cudf::column_device_view::create(point_indices, stream)
      });

this works because global_to_poly_and_point_indices now returns indices to point x y directly, which is passed to test_poly_point_intersection, then we changed PointIter points; to simple iterator on point x,y, so this gives us the actual data for point testing

https://github.com/rapidsai/cuspatial/blob/c4b1440fbac1b7a186ef9a4765ced1d1f1d14312/cpp/src/join/quadtree_point_in_polygon.cu#L71-L92

and global_to_poly_and_point_indices is if_copied back to user, which suppose to have indices to point x y. so this is the correct output.

We verified it with our spatial join cases, the above fix works. It produces the correct results compared to geopandas sjoin

KaiWang1337 avatar Oct 06 '22 00:10 KaiWang1337

In the documentation on " @note The returned polygon and point indices are offsets into the poly_quad_pairs inputs and (point_x, point_y), respectively", (point_x, point_y) pairs are sorted based on Morton code. To retrieve the original points, a thrust::gather op can be applied. Please see https://github.com/rapidsai/cuspatial/blob/37d3db570c52308468a6740ef3227717afc5b256/cpp/tests/join/point_in_polygon_test_large.cpp#L167

A python example using NYC taxi trip data and cudf.take python API is at https://github.com/zhangjianting/cuspatial_benchmark_nyctaxi/blob/fd88bae921d6eac74e176ab885ed9f5ac73220b9/python/spatial_join_verify.py#L116-120

zhangjianting avatar Oct 06 '22 07:10 zhangjianting

Hi, Regarding (point_x, point_y) pairs are sorted based on Morton code. No, one shouldn't expect (point_x, point_y) to be sorted. The API didn't require sorted point inputs, and it's unnecessary. Here's the explanation

As one can see, we pass those 3 args to quadtree_point_in_polygon https://github.com/rapidsai/cuspatial/blob/c4b1440fbac1b7a186ef9a4765ced1d1f1d14312/cpp/include/cuspatial/spatial_join.hpp#L85-L87

(point x, y ) are just data from user like (dataframe["longitude"], dataframe["latitude"]), the indices of point x,y are row _IDs of the original dataframe columns, so we can merge dataframes using the returned results. point_indices is an array of indices pointed to (point x, y), and point_indices is sorted by morton code So if (point x, y ) is expected to be sorted, why do we need point_indices?

(point x, y ) and point_indices should have different orders, otherwise, no need for point_indices and the problem with current code is that it returns indices to point_indices instead of point x,y as stated in API

point_indices is from phase 1 quad tree construction

see how compute_point_keys_and_sorted_indices return point_indices https://github.com/rapidsai/cuspatial/blob/37d3db570c52308468a6740ef3227717afc5b256/cpp/include/cuspatial/experimental/detail/indexing/construction/phase_1.cuh#L252-L267

within compute_point_keys_and_sorted_indices, the sorted point_indices by key (morton code) https://github.com/rapidsai/cuspatial/blob/37d3db570c52308468a6740ef3227717afc5b256/cpp/include/cuspatial/experimental/detail/indexing/construction/phase_1.cuh#L87-L93

clearly, if point x y are expected to be sorted by morton code, then we don't need this step in quad tree construction

KaiWang1337 avatar Oct 06 '22 13:10 KaiWang1337

@KaiWang1337 My first comment on the input of sorted points was based on an old implementation, sorry for the confusion. In the public code, the x/y columns were desgined to be immutable.
The second commnet seems still valid. The docstring on "@note The returned polygon and point indices are offsets into the poly_quad_pairs inputs and (point_x, point_y), respectively", the point indices offsets seems to be the offsets into the SORTED x/y columns. As the python example shows, the offset into the ORIGINAL x/y columns can be recovered using a gather operator, which should be used to access the orginal x/y columns. I guess this is essentially what you have proposed. for the C++ test example, cudf::gather was used to change the order of the input x/y columns [https://github.com/rapidsai/cuspatial/blob/37d3db570c52308468a6740ef3227717afc5b256/cpp/tests/join/point_in_polygon_test_large.cpp#L167] which serves the same purpose with respect to testing correctness. There might be a historical reason on the decision to return offsets into the sorted x/y columns (vs. the offset into original input x/y columns). The correctness of the desgin/implmentation has been verified intensively. The docstring may need to be updated though.

zhangjianting avatar Oct 06 '22 14:10 zhangjianting

hi @zhangjianting yes, I agreed the design is correct besides the point indices return part.

The API clearly says returned point indices are indices to point x y, which is expected to be unsorted in current implementation, which I believe is the intention. It didn't say anywhere that the return is to the point_indices array and needing of gathering. https://docs.rapids.ai/api/libcuspatial/stable/group__spatial__join.html#ga503dd7c07f7f557fd51cd68a7c0c2165 quote The returned point and polyline indices are offsets into the (point_x, point_y) and poly_quad_pairs inputs, respectively. ed

point x y is expected to be unsorted in current implementation as one can see, in compute_quadtree_point_in_polygon https://github.com/rapidsai/cuspatial/blob/c4b1440fbac1b7a186ef9a4765ced1d1f1d14312/cpp/src/join/quadtree_point_in_polygon.cu#L154-L157 it use point_indices (indices to point x y, sorted by morton as key) as permutation iterator get the original indices of point x y (if point x y is already sorted, then we don't need to do that)

The fixed I propose is just to return indices to point x y (original, unsorted) as this is the behavior stated by the API.

Sorting point x y (or in this case, just the indices) is a part of quad tree implementation, the user would be expect indices to the original column for things like spatial join

KaiWang1337 avatar Oct 06 '22 15:10 KaiWang1337

@KaiWang1337 I fully agree that the docstring should match exactly what the API does. I am sure the development team would be able to look into the issue and make a decision. Not sure whether you are developing on top of cuspatial or comparing against it. In etiher case, if possible, can you share a littlle bit more information about your work?

zhangjianting avatar Oct 06 '22 17:10 zhangjianting

@KaiWang1337 it would be a fantastic contribution if you could submit your fix as a PR for us to review! Thanks!

thomcom avatar Oct 18 '22 22:10 thomcom