earthaccess icon indicating copy to clipboard operation
earthaccess copied to clipboard

Refactor integration tests to remove random collection sampling

Open mfisher87 opened this issue 1 year ago • 9 comments

Resolves #215

cc @betolink just getting started on this. I have some sample code that can generate a list of 100 most popular collections, in order, given a provider.

Before I continue, I would like input!

Next steps may be:

  • Refactor things to be less quick-and-dirty. First consideration for me is the mismatch between "DAACs" in the tests and "providers" in the script that generates the popular lists.
  • Modify the tests to read the new source instead of randomly selecting. Can use first N rows from the file, depending on the test; currently not every test uses the same number of collections. Can create a fixture for providing the data in these files as a mapping of DAACs -> list of collections.
  • A "blocklist" of known-bad collections
  • ?

mfisher87 avatar Jul 06 '24 18:07 mfisher87

Worked on this with @itcarroll during hack day. Notes: https://github.com/nsidc/earthaccess/discussions/755

mfisher87 avatar Jul 09 '24 17:07 mfisher87

We considered the usefulness of random sampling tests. We don't think we should be doing this for integration tests, especially when they execute on every PR. We could, for example, run them on a cron job and create reports, but that seems like overkill when we have a community to help us identify datasets and connect with the right support channel if there's an issue with the provider.

We may still consider a cron job for, for examle, recalculating the most popular datasets on a monthly basis.

mfisher87 avatar Jul 09 '24 17:07 mfisher87

We decided we can hardcode a small number and expand the list as we go. Other things like random tests on a cron or updating the list of popular datasets on a cron can be addressed separately.

mfisher87 avatar Jul 09 '24 18:07 mfisher87

@betolink will take on work to update generate.py to generate top N collections for all providers.

@mfisher87 will continue working on test_onprem_download.py for just NSIDC_ECS for now to make it use the new source of collections.

mfisher87 avatar Aug 06 '24 18:08 mfisher87

We will update the .txt files to .csv files and add boolean field for "does the collection have a EULA?" and then we'll use that field to mark those tests as xfail.

mfisher87 avatar Aug 06 '24 18:08 mfisher87

Two major milestones:

  1. @danielfromearth updated the script which generates the top collection lists to use all providers supported by earthaccess :tada: Still TODO: Make them CSVs with a boolean representing whether the collection has a EULA
  2. We just got the test_onprem_download.py module working without randomization! :tada: Still TODO: Refactor the other 3 integration test modules to share this behavior. Let's try and remove duplicate code while we're at it!

Thanks to @deanhenze and @Sherwin-14 for collaborating on this on today's hackathon!

mfisher87 avatar Aug 21 '24 00:08 mfisher87

Also still TODO: Run generate.py in GHA on a monthly/quarterly cron and auto-open a PR with the changes to top collections?

mfisher87 avatar Aug 21 '24 00:08 mfisher87

If we want to determine whether a collection has a EULA, this example was provided:

curl -i -XGET "https://cmr.earthdata.nasa.gov/search/collections.json?concept_id=C1808440897-ASF&pretty=true"

The metadata "eula_identifiers" : [ "1b454cfb-c298-4072-ae3c-3c133ce810c8" ] is present in the response. We're not 100% sure whether this can be used authoritatively. Discussion in progress: https://nsidc.slack.com/archives/C2LRKMDEV/p1724179804149239

mfisher87 avatar Aug 21 '24 16:08 mfisher87

Looks like part of this issue may be related to work on EULAs in this issue.

danielfromearth avatar Oct 29 '24 17:10 danielfromearth

Binder :point_left: Launch a binder notebook on this branch for commit 3a09c11667e9d6293268251d19f3114ace1c7e1c

I will automatically update this comment whenever this PR is modified

Binder :point_left: Launch a binder notebook on this branch for commit e5aef6b08eb8cea08584832c7fac6846137c01a5

Binder :point_left: Launch a binder notebook on this branch for commit 6e0ca0bc5a983f04c6b094ddd8a44d826ff43ab9

Binder :point_left: Launch a binder notebook on this branch for commit 85c4bb9b48bb97769b7e97459b916cda4930a10b

Binder :point_left: Launch a binder notebook on this branch for commit bbe48788bf88f4a60f8fa17bd8e37f0890c1b615

Binder :point_left: Launch a binder notebook on this branch for commit 1b9de9870b110bdedfcb6b29a0bc1d7e7cc84554

Binder :point_left: Launch a binder notebook on this branch for commit ebfad6371277db9f7b89ac87af30c0589440ba9d

Binder :point_left: Launch a binder notebook on this branch for commit 5b40af603efb42df8f85ffa83cf37eb5734bdb06

Binder :point_left: Launch a binder notebook on this branch for commit b063347b897e871b7d523b98984416226454d089

Binder :point_left: Launch a binder notebook on this branch for commit 1d3df25f8b9d297939264b080fdb50074a1b9805

Binder :point_left: Launch a binder notebook on this branch for commit 670b0f1b5634134889ec5570d6def183a3b3f332

Binder :point_left: Launch a binder notebook on this branch for commit ffc1fecdf8065601ca140ea85ee78ba874b9c731

Binder :point_left: Launch a binder notebook on this branch for commit bf8b3d6a321d1d1530aee4dbc82e8929c1bdb8c0

Binder :point_left: Launch a binder notebook on this branch for commit 45b12c4657155bf6144ef23b0cd026d5a1acd81f

Binder :point_left: Launch a binder notebook on this branch for commit c28b8fa83f411a7356cf45067191318da97f687a

Binder :point_left: Launch a binder notebook on this branch for commit 836258e37353db3c9b04e1bd3873bf160c40cf3e

Binder :point_left: Launch a binder notebook on this branch for commit 2000420fa21e39924ad34099f97921155d56e92a

github-actions[bot] avatar Nov 26 '24 18:11 github-actions[bot]

For some reason, running

nox -s integration-tests -- --cov=earthaccess --cov-report=term-missing --capture=no --tb=native --log-cli-level=INFO

is failing, when

nox -s integration-tests

is passing.

mfisher87 avatar Nov 27 '24 00:11 mfisher87

Looks like this was a coincidence with intermittent uptime on an external dependency.

mfisher87 avatar Nov 27 '24 00:11 mfisher87

Wow, this is a big one! I can start today but I'm not sure if I can finish today! great work @mfisher87 !!

betolink avatar Dec 02 '24 16:12 betolink

Thanks for taking a look, @betolink ! There are some opportunities for refactoring, but I really tried to keep the scope narrow in this PR to avoid growing even bigger :)

mfisher87 avatar Dec 02 '24 18:12 mfisher87

Oh no! The tests are failing :rofl:

@betolink NSIDC doing maintenance?

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='n5eil01u.ecs.nsidc.org', port=443): Max retries exceeded with url: /DP5/ATLAS/ATL08.006/2018.10.14/ATL08_20181014070058_02390107_006_02.h5 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fbb54137850>: Failed to establish a new connection: [Errno 111] Connection refused'))

mfisher87 avatar Jan 22 '25 17:01 mfisher87

Oh no! The tests are failing 🤣

@betolink NSIDC doing maintenance?

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='n5eil01u.ecs.nsidc.org', port=443): Max retries exceeded with url: /DP5/ATLAS/ATL08.006/2018.10.14/ATL08_20181014070058_02390107_006_02.h5 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fbb54137850>: Failed to establish a new connection: [Errno 111] Connection refused'))

I think so, it's Wednesday! 😆

betolink avatar Jan 22 '25 17:01 betolink

image

We're so close to merging this behemoth :rofl:

mfisher87 avatar Jan 22 '25 19:01 mfisher87