spatialdata icon indicating copy to clipboard operation
spatialdata copied to clipboard

Problem with indices with `left_exclusive` join

Open LucaMarconato opened this issue 1 year ago • 1 comments

To reproduce

While working on some tests for https://github.com/scverse/spatialdata/pull/822, I discovered a bug with left_exclusive (unrelated to the bug addressed in the mentioned PR).

To reproduce please add the left_exclusive string in the pytest.mark.parameterize in test_inner_join_match_rows_duplicate_obs_indices() (test_relational_query.py). The bug it's unrelated to the one fixed by the mentioned PR because the issue appears also if we comment this line from the test:

sdata["table"].obs.index = ["a"] * sdata["table"].n_obs

The error I get is the following: @melonora could you please have a look at it?

Traceback

tests/core/query/test_relational_query.py:380 (test_inner_join_match_rows_duplicate_obs_indices[left_exclusive])
sdata_query_aggregation = SpatialData object
├── Points
│     └── 'points': DataFrame with shape: (<Delayed>, 5) (2D points)
├── Shapes
│     ├─...:
        points (Points), by_circles (Shapes), by_polygons (Shapes), values_circles (Shapes), values_polygons (Shapes)
join_type = 'left_exclusive'

    @pytest.mark.parametrize('join_type', ['left', 'right', 'inner', 'right_exclusive', 'left_exclusive'])
    def test_inner_join_match_rows_duplicate_obs_indices(sdata_query_aggregation: SpatialData, join_type: str) -> None:
        sdata = sdata_query_aggregation
        # sdata["table"].obs.index = ["a"] * sdata["table"].n_obs
        sdata["values_circles"] = sdata_query_aggregation["values_circles"][:4]
        sdata["values_polygons"] = sdata_query_aggregation["values_polygons"][:5]
    
>       element_dict, table = join_spatialelement_table(
            sdata=sdata,
            spatial_element_names=["values_circles", "values_polygons"],
            table_name="table",
            how=join_type,
        )

test_relational_query.py:388: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../src/spatialdata/_core/query/relational_query.py:680: in join_spatialelement_table
    elements_dict_joined, table = _call_join(elements_dict, table, how, match_rows)
../../../src/spatialdata/_core/query/relational_query.py:697: in _call_join
    elements_dict, table = JoinTypes[how](elements_dict, table, match_rows)
../../../src/spatialdata/_core/query/relational_query.py:528: in __call__
    return self.value(*args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

element_dict = defaultdict(<function _create_sdata_elements_dict_for_join.<locals>.<lambda> at 0x3087c3c40>, {'shapes': defaultdict(<...))  ...                3
4  POLYGON ((66 -6, 66 6, 78 6, 78 -6, 66 -6))  ...                4

[5 rows x 3 columns]})})
table = AnnData object with n_obs × n_vars = 21 × 1
    obs: 'region', 'instance_id', 'categorical_in_obs', 'numerical_in_obs'
    uns: 'spatialdata_attrs'
match_rows = 'no'

    def _left_exclusive_join_spatialelement_table(
        element_dict: dict[str, dict[str, Any]], table: AnnData, match_rows: Literal["left", "no", "right"]
    ) -> tuple[dict[str, Any], AnnData | None]:
        regions, region_column_name, instance_key = get_table_keys(table)
        groups_df = table.obs.groupby(by=region_column_name, observed=False)
        for element_type, name_element in element_dict.items():
            for name, element in name_element.items():
                if name in regions:
                    group_df = groups_df.get_group(name)
                    table_instance_key_column = group_df[instance_key]
                    if element_type in ["points", "shapes"]:
                        mask = np.full(len(element), True, dtype=bool)
>                       mask[table_instance_key_column.values] = False
E                       IndexError: index 4 is out of bounds for axis 0 with size 4

../../../src/spatialdata/_core/query/relational_query.py:435: IndexError

Process finished with exit code 1

LucaMarconato avatar Jan 13 '25 13:01 LucaMarconato

I would actually use this occasion to simplify the code with join operations. There is quite some redundant code and currently the join operations are difficult to maintain because of that.

LucaMarconato avatar Feb 06 '25 21:02 LucaMarconato