satpy Scene.__getitem__ not "greedy" enough

Describe the bug Currently it is possible to get a dataset identified only by name with a more concise DataID with name and for example resolution. This is also relevant for DataID in Scene testing and dataset deletion (see code examples below). ~~A __setitem__ operation with a more specific DataID does not overwrite the dataset with a less specific DataID which is good~~ Turns out that the dataset with the less specific DataID is overwritten (at least with satpy main as of 04.12.2024) but the DataID added to the dependency_tree is wrong.

To Reproduce

from satpy.tests.utils import make_dataid
scene = Scene()
scene['1'] = ds1 = xr.DataArray(np.arange(5))
dataid = make_dataid(name='1', resolution=2000) 
print(dataid in scene) # => True should be False

scene[dataid] # => returns ds1 but should return KeyError

del scene[dataid]
print(scene) # => should not be empty


scene['1'] = ds1 = xr.DataArray(np.arange(5))
print(scene._dependency_tree)

scene[dataid] = ds2 = xr.DataArray(np.arange(5, 10))
print(scene._dependency_tree)

Expected behavior See code example.

Environment Info:

OS: Linux
Satpy Version: main branch
PyResample Version: 1.26.0

Additional context As far as I can see this behaviour is due to the get_key method in DatasetDict (https://github.com/pytroll/satpy/blob/cfe4fa964be23950275d68de39f8ffcbb6dd5aba/satpy/dataset/data_dict.py#L142) not being "greedy" enough.

Dec 20 '22 13:12 BENR0

Related: https://github.com/pytroll/satpy/pull/1807 and the bug report it references https://github.com/pytroll/satpy/pull/1806

Dec 03 '24 18:12 djhoese

Here's the current test I've made to work on this:

def test_dataset_dict_contains_inexact_match():
    """Test that DatasetDict does not match inexact keys to existing keys.

    Specifically, check that additional DataID properties aren't ignored
    when querying the DatasetDict.

    See https://github.com/pytroll/satpy/issues/2331.
    """
    from satpy.dataset.data_dict import DatasetDict

    dd = DatasetDict()
    name = "1"
    item = xr.DataArray(())
    dd[name] = item
    exact_id = make_dataid(name=name)
    assert exact_id in dd

    inexact_id = make_dataid(name=name, resolution=2000)
    assert inexact_id not in dd

Let me know if this doesn't match the initial failing case in your example code.

Dec 03 '24 20:12 djhoese

@BENR0 If we dive deeper into get_key it hits a point where it takes all the possible DataIDs in the DatasetDict and filters out those that don't match the DataQuery (your DataID with resolution in this case). It hits this point in the filtering:

https://github.com/pytroll/satpy/blob/86079ed0820f86c061637a87200ff3f6a91a3210/satpy/dataset/dataid.py#L584-L592

where keys_to_check is {"name"} and does NOT include resolution. This is because since the contained DataID is just DataID(name="1") it thinks there is no point in also searching the other properties. Why it does it this way...I'm not sure. I hope to look later tonight or tomorrow.

Dec 03 '24 21:12 djhoese

Another way I see of looking at the above linked method is that keys_to_check and the set operations that create it are trying to say "only check keys that the DataQuery has, don't check those that only the DataID has". This makes sense because we wouldn't want to check if the DataID's calibration matched the DataQuery's calibration if the DataQuery didn't have calibration as a parameter. But what these set operations are also doing by accident (I think) is saying "don't check keys that the DataID doesn't have". Maybe that can help me nail down a solution.

Edit: Another interesting find, the ID keys used when assigning to the Scene with a single string ("1" in your example) only has the "name" and the "resolution" keys:

In [5]: scene = Scene()

In [6]: scene['1'] = ds1 = xr.DataArray(np.arange(5))

In [7]: did = list(scene._datasets.keys())[0]

In [8]: did._id_keys
Out[8]: {'name': {'required': True}, 'resolution': {'transitive': True}}

Now why would the resolution be in that key?

Edit 2: Oh those are the minimal default ID keys set in dataid.py.

Edit 3:

Here's another small example of the difficulties of making this work:

When you do scene["1"] = xr.DataArray(..., attrs={}) the underlying DataID that gets created will have the minimal set of ID keys which are "name" and "resolution".
If you do did = make_dataid(name="1") (which is a testing utility function), the resulting DataID actually includes modifiers=().
So when we get inside the DatasetDict for did in dataset_dict (same as did in scene), we need to say "does the DataID which has no modifiers, have exactly modifiers equal to '()'". The answer is no and that's why some of the fuzziness is going on here about what matches and what doesn't and what keys to look at.

Dec 03 '24 21:12 djhoese

@djhoese Thanks for looking into this. I already thought that this was not an easy solve. I will have to have a closer look at this to comment on your points since I am not that familiar with that part of the code.

Dec 04 '24 08:12 BENR0

No worries. My comments are mostly for my own record keeping.

Dec 04 '24 11:12 djhoese

@BENR0 @mraspaud Ok I found a test that fails when I make the case in this PR pass. The test is exactly the case we have in this issue and are trying to fix:

        dataid_container = [DataID(minimal_default_keys_config,
                                   name="natural_color")]
        dq = DataQuery(name="natural_color", resolution=250)
        assert len(dq.filter_dataids(dataid_container)) == 1

So in this case, why does this data query which specifies resolution=250 result in a DataID that doesn't have resolution specified. There must be some use case here that I'm forgetting.

Dec 04 '24 15:12 djhoese

Ok this issue is going to just be my dumping ground for things as I learn them. So the current (existing) test that I'm playing with is:

        scene = Scene(filenames=["fake1_1.txt"], reader="fake1")
        comp25 = make_cid(name="comp25", resolution=1000)
        scene[comp25] = xr.DataArray([], attrs={"name": "comp25", "resolution": 1000})
        scene.load(["comp25"], resolution=500)

        loaded_ids = list(scene._datasets.keys())
        assert len(loaded_ids) == 2
        assert loaded_ids[0]["name"] == "comp25"
        assert loaded_ids[0]["resolution"] == 500
        assert loaded_ids[1]["name"] == "comp25"
        assert loaded_ids[1]["resolution"] == 1000

Which with the below changes:

Subject: [PATCH] Search all DataID keys for query matching
===================================================================
diff --git a/satpy/dataset/dataid.py b/satpy/dataset/dataid.py
--- a/satpy/dataset/dataid.py	(revision 452c1f6fccdaa513458b19890343c2a334b30557)
+++ b/satpy/dataset/dataid.py	(date 1733326086455)
@@ -583,10 +583,11 @@
 
     def _match_dataid(self, dataid):
         """Match the dataid with the current query."""
-        if self._shares_required_keys(dataid):
-            keys_to_check = set(dataid.keys()) & set(self._fields)
-        else:
-            keys_to_check = set(dataid._id_keys.keys()) & set(self._fields)
+        keys_to_check = set(dataid._id_keys.keys()) & set(self._fields)
+        # if self._shares_required_keys(dataid):
+        #     keys_to_check = set(dataid.keys()) & set(self._fields)
+        # else:
+        #     keys_to_check = set(dataid._id_keys.keys()) & set(self._fields)
         if not keys_to_check:
             return False
         return all(self._match_query_value(key, dataid.get(key)) for key in keys_to_check)

Fails on the scene.load because, from what I can tell, the loading tries to say "of the composites we loaded from the YAML, find DataID(name='comp25', resolution=500)". This request makes sense because we gave the global request for resolution=500, but in this patched version of Satpy it fails because that DataID does not exist in the DatasetDict of composites we know about. We of course know what "comp25" is, but we never try to search for that because we were given resolution=500. So...is it correct for Satpy to assume that composites can and will be made the correct resolution by the proper dependencies being loaded and it is fine to drop all non-required ID-keys/parameters when searching for a composite? I think from a generic DatasetDict sense, the current Satpy implementation is the wrong approach. That is, "is X in dataid_container" and you get a result when it does not exist in that container. I'll see if I can come up with a reasonable solution.

Dec 04 '24 18:12 djhoese

Man, this is really turning into a Satpy 1.0 type of thing. So the biggest competing functionalities currently:

Compositors (the class that creates a composite) are typically only referred to by "name", but have the ability to modify that information in their __init__ and by whatever the user specifies in the YAML. It ends up getting used in the CompositeBase.id property which creates a DataID used by the dependency tree. Note I don't know how often a compositor modifies it's DataID in __init__ in the current builtin composite definitions.
Passing query parameters to Scene.load. For example scene.load(["some_composite"], resolution=500). Should the compositor instance "have" resolution=500? Or should only the DataArray it produces have that and have that based on the inputs?

Possible (partial) solutions rolling around in my brain:

Don't instantiate compositor classes until they are used. For example, wrapping them in some wrapper class so the .id stuff is easier to manipulate before the compositor instance needs it.
Modify how compositors are identified in the dependency tree so the dep tree ID for a compositor might differ from compositor.id. This is already true for prereqs and optional prereqs which have separate instances in the dependency tree nodes.
Allow copying of a Compositor class...no...I don't like this.

Edit: Turns out this idea of not creating the compositor instance until it is needed/used is actually how modifiers work. The dictionary of modifiers loaded from YAML is a tuple of (modifier_class, modifier_options).

Dec 04 '24 19:12 djhoese

Ugh this stuff is all terrible. So I started working on the solution where compositors/composites would now have all of the query information in their identifiers. So if you asked for resolution=500 then the compositor would be identified in the dependency tree by DataID(name="comp", resolution=500). However, this solution only worked because resolution is in the minimal ID keys (which are name and resolution only). When it comes to other properties like calibration=, my solution ignores them because they aren't part of the compositor's satpy ID keys (minimal ID keys - name and resolution). This then fails later because it still tries to load the version of the ID with the calibration=. Wait...can I change that? Let's see...

Edit: Actually, hold up, I was wrong about why my solution wasn't working I think. The main issue is that DataID(name="comp3") == DataQuery(name="comp3", calibration="reflectance").

Edit 2: @mraspaud I'm not sure how I feel about DataQuery equality only checking shared keys. At least, it should be stricter for DataIDs shouldn't it?

https://github.com/pytroll/satpy/blob/86079ed0820f86c061637a87200ff3f6a91a3210/satpy/dataset/dataid.py#L515-L532

Dec 05 '24 02:12 djhoese

1. Compositors (the class that creates a composite) are typically only referred to by "name", but have the ability to modify that information in their `__init__` and by whatever the user specifies in the YAML. It ends up getting used in the `CompositeBase.id` property which creates a `DataID` used by the dependency tree. Note I don't know how often a compositor modifies it's DataID in `__init__` in the current builtin composite definitions.

I think for this point #2333 is somehow connected also. While this is a different use case it also is related to how/when `DataID's are generated.

I probably don't have enough overview over this whole thing especially on how the loading mechanisms and DataID and DependencyTree mechanisms work together but I think compositors or modifiers shouldn't be in charge of generating/changing DataIDs themselves. Instead I think they should put appropriate metadata into the DataArrays they return and the Scene.__set_item__ should handle the generation of the DataID from the attributes. That way the generation would be in a central place.

Dec 05 '24 09:12 BENR0

Ok this issue is going to just be my dumping ground for things as I learn them. So the current (existing) test that I'm playing with is:

        scene = Scene(filenames=["fake1_1.txt"], reader="fake1")
        comp25 = make_cid(name="comp25", resolution=1000)
        scene[comp25] = xr.DataArray([], attrs={"name": "comp25", "resolution": 1000})
        scene.load(["comp25"], resolution=500)

        loaded_ids = list(scene._datasets.keys())
        assert len(loaded_ids) == 2
        assert loaded_ids[0]["name"] == "comp25"
        assert loaded_ids[0]["resolution"] == 500
        assert loaded_ids[1]["name"] == "comp25"
        assert loaded_ids[1]["resolution"] == 1000

For the DataID matching when looking for datasets in the Scene I think it is necessary to think about what different scenarios should do (in #2332 I added a couple of tests for different cases how I think it should be but I think there are still some cases missing).

Assuming:

Scene:
|-DataID(name=1)
+-DataID(name=1, resolution=1000)

I think Scene[DataID(name=1, resolution=1000)] should clearly get the second dataset

In the case

Scene:
+-DataID(name=1)

Scene[DataID(name=1, resolution=1000)] should give a KeyError

In the case

Scene:
+-DataID(name=1, resolution=1000)

What should Scene[DataID(name=1)] return? Also a KeyError or should the dataset be returned because it is the only one with that name?

And in the case

Scene:
|-DataID(name=1)
+-DataID(name=1, resolution=1000)

What should Scene[DataID(name=1)] return? Both datasets or only the first one with the exact match?

Dec 05 '24 09:12 BENR0

Also I have been thinking about being able to get multiple datasets with a list like it is possible in xarray which is not directly related to the problem at hand but might be worth thinking about while working on a solution for this?

Dec 05 '24 09:12 BENR0

Instead I think they should put appropriate metadata into the DataArrays they return and the Scene.set_item should handle the generation of the DataID from the attributes. That way the generation would be in a central place.

@BENR0 you mentioned this in #2333 and while I agree this would be nice, I don't think it can happen given the tight coupling of .attrs["_satpy_id"] and the rest of the metadata in .attrs. Additionally, we can't depend on the Scene or any __setitem__ stuff to update/use DataIDs as compositor classes can be used outside of the Scene ("manual" composite generation).

@mraspaud What do you think about an xarray accessor for data_arr.satpy.id? Or data_arr.satpy.data_id? Then _satpy_id would only be expected in .attrs during CF NetCDF writing or reading, but only as a temporary measure.

In the case

Scene: +-DataID(name=1, resolution=1000)

What should Scene[DataID(name=1)] return? Also a KeyError or should the dataset be returned because it is the only one with that name?

@BENR0 Be careful of your use of DataID versus DataQuery. In the above case I'd say this should be a KeyError only because it is a DataID being passed to __getitem__. If it were a DataQuery then it would return the single DataArray that's in there. In the later case you defined with a non-resolution "1" and a resolution=1000 "1", a DataQuery(name="1") would return whichever one follows some defined rules...but I think currently this is not defined. There are rules defined for DataIDs of different resolutions/calibrations/etc and vague queries (ex. always the finer/better resolution), but I'm not sure those explicitly say what would happen in the undefined versus defined case.

The more I think about this the more I think I'm going to have to change the things that "smell", see what breaks, and then fix things as I go. The alternative is to define all the use cases that we know we want to support (possibly a lot) and re-write tests for them and make them work and then go back and fix or remove tests that fail. I think I can do the former one piece at a time for the most part. The latter seems cleaner, but also much harder to get right given all the historic code here.

@mraspaud I also think I need to start adding type annotations since there are a lot of places in the dependency tree where it is just "yeah, it's a thing that identifies what we want" but isn't clearly a DataID or DataQuery or dict.

Dec 05 '24 16:12 djhoese

@mraspaud What do you think about an xarray accessor for data_arr.satpy.id? Or data_arr.satpy.data_id? Then _satpy_id would only be expected in .attrs during CF NetCDF writing or reading, but only as a temporary measure.

That works for me, I’ve been wanting to have an accessor for a while now anyway…

The more I think about this the more I think I'm going to have to change the things that "smell", see what breaks, and then fix things as I go. The alternative is to define all the use cases that we know we want to support (possibly a lot) and re-write tests for them and make them work and then go back and fix or remove tests that fail. I think I can do the former one piece at a time for the most part. The latter seems cleaner, but also much harder to get right given all the historic code here.

How about a middle ground: while removing smells, see what tests they apply to and refactor them into clear use cases?

@mraspaud I also think I need to start adding type annotations since there are a lot of places in the dependency tree where it is just "yeah, it's a thing that identifies what we want" but isn't clearly a DataID or DataQuery or dict.

That makes sense if it helps clarify the code.

Dec 06 '24 08:12 mraspaud

So the accessor turns out to be impossible due to DataID id_keys needing to stick around. So that information would need to be kept somewhere and can't be kept in the ephemeral accessor object so it would have to be in .attrs.

And yeah I like that middle ground. I think I was initially concerned about it because the first smell I tried to remove revealed a lot of code smells. I'll try to remove some of the minor ones first and see what kind of test failures I get.

Dec 07 '24 16:12 djhoese

I think the main thing I'm finding that I disagree with in the current implementation is that there is a general idea of a DataQuery and DataID being equal if their shared keys are equal. But that's not what a user means when they use a DataQuery and is not very useful otherwise. If I specify something in my DataQuery then any matching DataIDs better have that key and the value should be equal (the exception being the special "*").

Dec 09 '24 19:12 djhoese