Refactor integration tests to remove random collection sampling
Resolves #215
cc @betolink just getting started on this. I have some sample code that can generate a list of 100 most popular collections, in order, given a provider.
Before I continue, I would like input!
Next steps may be:
- Refactor things to be less quick-and-dirty. First consideration for me is the mismatch between "DAACs" in the tests and "providers" in the script that generates the popular lists.
- Modify the tests to read the new source instead of randomly selecting. Can use first N rows from the file, depending on the test; currently not every test uses the same number of collections. Can create a fixture for providing the data in these files as a mapping of DAACs -> list of collections.
- A "blocklist" of known-bad collections
- ?
Worked on this with @itcarroll during hack day. Notes: https://github.com/nsidc/earthaccess/discussions/755
We considered the usefulness of random sampling tests. We don't think we should be doing this for integration tests, especially when they execute on every PR. We could, for example, run them on a cron job and create reports, but that seems like overkill when we have a community to help us identify datasets and connect with the right support channel if there's an issue with the provider.
We may still consider a cron job for, for examle, recalculating the most popular datasets on a monthly basis.
We decided we can hardcode a small number and expand the list as we go. Other things like random tests on a cron or updating the list of popular datasets on a cron can be addressed separately.
@betolink will take on work to update generate.py to generate top N collections for all providers.
@mfisher87 will continue working on test_onprem_download.py for just NSIDC_ECS for now to make it use the new source of collections.
We will update the .txt files to .csv files and add boolean field for "does the collection have a EULA?" and then we'll use that field to mark those tests as xfail.
Two major milestones:
- @danielfromearth updated the script which generates the top collection lists to use all providers supported by earthaccess :tada: Still TODO: Make them CSVs with a boolean representing whether the collection has a EULA
- We just got the
test_onprem_download.pymodule working without randomization! :tada: Still TODO: Refactor the other 3 integration test modules to share this behavior. Let's try and remove duplicate code while we're at it!
Thanks to @deanhenze and @Sherwin-14 for collaborating on this on today's hackathon!
Also still TODO: Run generate.py in GHA on a monthly/quarterly cron and auto-open a PR with the changes to top collections?
If we want to determine whether a collection has a EULA, this example was provided:
curl -i -XGET "https://cmr.earthdata.nasa.gov/search/collections.json?concept_id=C1808440897-ASF&pretty=true"
The metadata "eula_identifiers" : [ "1b454cfb-c298-4072-ae3c-3c133ce810c8" ] is present in the response. We're not 100% sure whether this can be used authoritatively. Discussion in progress: https://nsidc.slack.com/archives/C2LRKMDEV/p1724179804149239
Looks like part of this issue may be related to work on EULAs in this issue.
:point_left: Launch a binder notebook on this branch for commit 3a09c11667e9d6293268251d19f3114ace1c7e1c
I will automatically update this comment whenever this PR is modified
:point_left: Launch a binder notebook on this branch for commit e5aef6b08eb8cea08584832c7fac6846137c01a5
:point_left: Launch a binder notebook on this branch for commit 6e0ca0bc5a983f04c6b094ddd8a44d826ff43ab9
:point_left: Launch a binder notebook on this branch for commit 85c4bb9b48bb97769b7e97459b916cda4930a10b
:point_left: Launch a binder notebook on this branch for commit bbe48788bf88f4a60f8fa17bd8e37f0890c1b615
:point_left: Launch a binder notebook on this branch for commit 1b9de9870b110bdedfcb6b29a0bc1d7e7cc84554
:point_left: Launch a binder notebook on this branch for commit ebfad6371277db9f7b89ac87af30c0589440ba9d
:point_left: Launch a binder notebook on this branch for commit 5b40af603efb42df8f85ffa83cf37eb5734bdb06
:point_left: Launch a binder notebook on this branch for commit b063347b897e871b7d523b98984416226454d089
:point_left: Launch a binder notebook on this branch for commit 1d3df25f8b9d297939264b080fdb50074a1b9805
:point_left: Launch a binder notebook on this branch for commit 670b0f1b5634134889ec5570d6def183a3b3f332
:point_left: Launch a binder notebook on this branch for commit ffc1fecdf8065601ca140ea85ee78ba874b9c731
:point_left: Launch a binder notebook on this branch for commit bf8b3d6a321d1d1530aee4dbc82e8929c1bdb8c0
:point_left: Launch a binder notebook on this branch for commit 45b12c4657155bf6144ef23b0cd026d5a1acd81f
:point_left: Launch a binder notebook on this branch for commit c28b8fa83f411a7356cf45067191318da97f687a
:point_left: Launch a binder notebook on this branch for commit 836258e37353db3c9b04e1bd3873bf160c40cf3e
:point_left: Launch a binder notebook on this branch for commit 2000420fa21e39924ad34099f97921155d56e92a
For some reason, running
nox -s integration-tests -- --cov=earthaccess --cov-report=term-missing --capture=no --tb=native --log-cli-level=INFO
is failing, when
nox -s integration-tests
is passing.
Looks like this was a coincidence with intermittent uptime on an external dependency.
Wow, this is a big one! I can start today but I'm not sure if I can finish today! great work @mfisher87 !!
Thanks for taking a look, @betolink ! There are some opportunities for refactoring, but I really tried to keep the scope narrow in this PR to avoid growing even bigger :)
Oh no! The tests are failing :rofl:
@betolink NSIDC doing maintenance?
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='n5eil01u.ecs.nsidc.org', port=443): Max retries exceeded with url: /DP5/ATLAS/ATL08.006/2018.10.14/ATL08_20181014070058_02390107_006_02.h5 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fbb54137850>: Failed to establish a new connection: [Errno 111] Connection refused'))
Oh no! The tests are failing 🤣
@betolink NSIDC doing maintenance?
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='n5eil01u.ecs.nsidc.org', port=443): Max retries exceeded with url: /DP5/ATLAS/ATL08.006/2018.10.14/ATL08_20181014070058_02390107_006_02.h5 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fbb54137850>: Failed to establish a new connection: [Errno 111] Connection refused'))
I think so, it's Wednesday! 😆
We're so close to merging this behemoth :rofl: