earthaccess The `cloud_hosted` flag for granule queries doesn't work

As discovered in discussion of #563, using the cloud_hosted parameter for a granule query does not work.

This reproduces the problem:

import earthaccess

results = earthaccess.search_data(
    short_name="VIIRSJ1_L2_OC",
    version="R2022.0",
    cloud_hosted=True,
    temporal=("2024-02-27 00:00:00", "2024-02-27 23:59:00"),
    count=10,
    bounding_box=(-180, 0, 0, 90),
)

The specified collection is not cloud hosted, so the query should return an empty list of results, but instead returns a non-empty list of results.

Alternatively, instead of returning an empty list of results, we could raise an exception. If we take this route, we would need to decide whether to use a built-in type, such as ValueError or TypeError, or define a custom exception.

May 10 '24 23:05 chuckwondo

Another option would be to eliminate the cloud_hosted parameter from granule queries, particularly given that it is not actually directly supported by the underlying CMR Search API. Only collection queries support it. Thus, this parameter requires us to make an implicit collection query under the covers, prior to submitting the granule search (if there is a collection with a cloud_hosted value matching the parameter value).

By eliminating the parameter, it is up to the user to either know whether or not the collection is cloud hosted, or to issue a separate collection query first to determine whether or not it is cloud hosted. Given that we would need to make such a collection query under the covers anyway, if we keep the cloud_hosted parameter for granule queries, there would be no difference in performance. In fact, by not implicitly performing the collection query, the user is able to avoid the extra query, if they already know whether or not the collection is cloud hosted. Further, being explicit over implicit is the 2nd principle of The Zen of Python, so it is worth considering.

May 10 '24 23:05 chuckwondo

Thanks for framing this problem @chuckwondo, I'm inclined to retain the cloud_hosted parameter at the granule level in order to save our users the extra query. Likewise, there is no DOI parameter at the granule level and (anecdotally) this is one the most useful features in the search_data method according to users.

May 21 '24 03:05 betolink

@chuckwondo I believe this example aligns with this particular problem:

results = earthaccess.search_data(
    doi='10.5067/ATLAS/ATL15.004',
    bounding_box=(180, 60, -180, 90),  # (lower_left_lon, lower_left_lat , upper_right_lon, upper_right_lat))
    cloud_hosted=True,
)

This is an example in a CryoCloud book tutorial by @mrsiegfried and @wsauthoff which returns files back from the on-prem copy of the data, with the same DOI. If you do the same search in Earthdata Search and set their "Available in Earthdata Cloud" filter, the correct cloud-hosted collection is returned.

This is especially problematic for DAACs including NSIDC who are still migrating to Earthdata Cloud and have both on-prem and cloud-hosted collections available.

This may be another good use of a decision committee per #761 . I'm also inclined to retain these parameters in search.data() for simplicity as long as the behavior is properly documented, and ensure that this is doing the right thing by using the CMR cloud_hosted collection filter prior to granule filter.

Nov 01 '24 14:11 asteiker

Interestingly, this behavior is only problematic if search.data() is using a DOI instead of short_name. For example:

results = earthaccess.search_data(
    short_name =  'ATL06',
    #doi='10.5067/ATLAS/ATL06.006',
    cloud_hosted=True,
    temporal=("2023-02-01 00:00:00", "2024-02-27 23:59:00"),
    bounding_box= (10,0,20,90),
    count=1
)

This will return the correct cloud hosted granule results. If you swap to DOI instead, it will return the ECS (on-prem)-hosted file.

Nov 01 '24 15:11 asteiker

I think the problem is that DOI is only searchable at the collections level see (and also not all granules have a DOI), so internally search_data uses DataCollections to get the concept_id.

However, this call to DataCollections does not know if cloud_hosted has been set or not. It just blindly grabs the concept_id for the first collection returned.

I think this should be a separate issue.

One solution would be to add .cloud_hosted(self.cloud_hosted) to L923

result = earthaccess.search.DataCollections().doi('10.5067/ATLAS/ATL06.006').cloud_hosted(False).get()
result[0]["meta"]["s3-links"]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[50], line 2
      1 result = earthaccess.search.DataCollections().doi('10.5067[/ATLAS/ATL06.006](http://localhost:8889/ATLAS/ATL06.006)').get()
----> 2 result[0]["meta"]["s3-links"]

KeyError: 's3-links'

result = earthaccess.search.DataCollections().doi('10.5067/ATLAS/ATL06.006').cloud_hosted(True).get()
result[0]["meta"]["s3-links"]

['nsidc-cumulus-prod-protected/ATLAS/ATL06/006',
 'nsidc-cumulus-prod-public/ATLAS/ATL06/006']

Nov 05 '24 04:11 andypbarrett