earthaccess icon indicating copy to clipboard operation
earthaccess copied to clipboard

Added parameters and exception behaviour to pqdm

Open Sherwin-14 opened this issue 1 year ago • 27 comments

I did all the steps outlined in the issue #581, we can now build upon this.


📚 Documentation preview 📚: https://earthaccess--792.org.readthedocs.build/en/792/

Sherwin-14 avatar Aug 24 '24 17:08 Sherwin-14

@mfisher87, I see lots of Ruff error messages regarding a number of files that @Sherwin-14 has not touched in this PR. Is there something already on main to address these errors, such that Sherwin simply needs to merge main into his own branch to silence them?

chuckwondo avatar Aug 26 '24 13:08 chuckwondo

Not yet, but that should be an easy fix. I'll take care of it now :) https://github.com/nsidc/earthaccess/issues/788

mfisher87 avatar Aug 26 '24 16:08 mfisher87

https://github.com/nsidc/earthaccess/pull/793

mfisher87 avatar Aug 26 '24 16:08 mfisher87

Not yet, but that should be an easy fix. I'll take care of it now :) #788 #793

Excellent! Thank you for the quick fix.

@Sherwin-14, Matt's fix is now merged into main, so please update your branch with the fix from main, which will eliminate the ruff errors in your build.

chuckwondo avatar Aug 26 '24 16:08 chuckwondo

@Sherwin-14, do you have enough information to proceed with the changes I requested?

chuckwondo avatar Sep 03 '24 10:09 chuckwondo

@Sherwin-14, do you have enough information to proceed with the changes I requested?

Yeah I have the info I needed, I had put this on hold since I was preoccupied with some other things for the last couple of days. I'll resume my work on this asap.

Sherwin-14 avatar Sep 03 '24 11:09 Sherwin-14

Would it be possible and in scope of this PR to use **kwargs and pass them to pqdm for both open() and download() with sane defaults? Users could be in control of the failing behavior but also the number of threads to use and the verbosity (don't display the progress bar). I'm not sure about the style, in Xarray they use **kwargs and an equivalent backend_kwargs={} that in our case could be something like pqdm_opts={}

# code in api.py and something similar in store.py
# in this example we leave threads out to ensure backwards compatibility and will use it in the store class if present
def open(granules, threads, provider, pqdm_opts):
    default_pqdm_opts = {
      n_jobs=threads,
      exception_behaviour='immediate'
      colour="green",
      disable=False # True means no output including the progress bar.
    }

    if not pqdm_opts;
        pqdm_opts = default_pqdm_opts
    earthaccess.__store__.open(granules, treads, pqdm_opts)

and actually we should do the same with fsspec so we let users tune their caching behavior when opening the files, aka a "smart_open". What do you all think?

betolink avatar Sep 03 '24 15:09 betolink

I had put this on hold since I was preoccupied with some other things for the last couple of days. I'll resume my work on this asap.

@Sherwin-14 No rush! We really appreciate any time you can spare, and don't want to ask for more than that :)

mfisher87 avatar Sep 03 '24 16:09 mfisher87

@betolink I really like that idea. Basically "Allow customizing all pqdm behaviors" (and separately, "Allow customizing all fsspec behaviors")?

I'm not sure how I feel about including that as part of this PR. I lean towards separate out of respect for @Sherwin-14 's time, but I don't want to speak for you, Sherwin!

mfisher87 avatar Sep 03 '24 16:09 mfisher87

We decided during hack day today that we want to allow users to customize pqdm behaviors like so:

def open(something, threads=16, *, pqdm_kwargs={...}, ...):
    pqdm_kwargs_defaults = {"exception_behavior": "immediate", "threads": threads}
    pqdm_kwargs = pqdm_kwargs_defaults.update(pqdm_kwargs)
    ...

We considered the threads argument as fundamental to the function of earthaccess, so we think it should remain a top-level parameter. We'll use threads whether we use pqdm or not in 5 years.

mfisher87 avatar Sep 03 '24 17:09 mfisher87

We decided during hack day today that we want to allow users to customize pqdm behaviors like so:

def open(something, threads=16, *, pqdm_kwargs={...}, ...):
    pqdm_kwargs_defaults = {"exception_behavior": "immediate", "threads": threads}
    pqdm_kwargs = pqdm_kwargs_defaults.update(pqdm_kwargs)
    ...

We considered the threads argument as fundamental to the function of earthaccess, so we think it should remain a top-level parameter. We'll use threads whether we use pqdm or not in 5 years.

How should we now proceed with the changes we discussed during the hackday?

Sherwin-14 avatar Sep 05 '24 10:09 Sherwin-14

We decided during hack day today that we want to allow users to customize pqdm behaviors like so:

def open(something, threads=16, *, pqdm_kwargs={...}, ...):
    pqdm_kwargs_defaults = {"exception_behavior": "immediate", "threads": threads}
    pqdm_kwargs = pqdm_kwargs_defaults.update(pqdm_kwargs)
    ...

We considered the threads argument as fundamental to the function of earthaccess, so we think it should remain a top-level parameter. We'll use threads whether we use pqdm or not in 5 years.

How should we now proceed with the changes we discussed during the hackday?

Do the same thing you did in the PR you already started for this, but instead of fail_fast (which is what you pretty much completed) change it to pqdm_kwargs and adjust your code changes similar to what is shown above, although the code above won't work because pqdm has an n_jobs arg, not a threads arg, so I suggest something like so, instead, within the functions/methods that invoke pqdm:

pqdm_kwargs = {
    "exception_behavior": "immediate",
    "n_jobs": threads,
    **pqdm_kwargs,
}

...

pqdm(<other args>, **pqdm_kwargs)

chuckwondo avatar Sep 05 '24 13:09 chuckwondo

Great point! Much clearer with the double splat too ;)

mfisher87 avatar Sep 05 '24 23:09 mfisher87

@Sherwin-14, do you need some help with this?

chuckwondo avatar Sep 14 '24 12:09 chuckwondo

@Sherwin-14, do you need some help with this?

I will try to accomodate the change asap, I was caught up in something else up untill now

Sherwin-14 avatar Sep 14 '24 16:09 Sherwin-14

@Sherwin-14, do you need some help with this?

I will try to accomodate the change asap, I was caught up in something else up untill now

No rush. I just wanted to check in to see if you need any help or if something else was requiring you to postpone your continuation.

chuckwondo avatar Sep 14 '24 16:09 chuckwondo

@chuckwondo

def download(
    granules: Union[DataGranule, List[DataGranule], str, List[str]],
    local_path: Optional[str],
    provider: Optional[str] = None,
    threads: int = 8,
    pqdm_kwargs: dict = {}
) -> List[str]:
    """Retrieves data granules from a remote storage system.

       * If we run this in the cloud, we will be using S3 to move data to `local_path`.
       * If we run it outside AWS (us-west-2 region) and the dataset is cloud hosted,
            we'll use HTTP links.

    Parameters:
        granules: a granule, list of granules, a granule link (HTTP), or a list of granule links (HTTP)
        local_path: local directory to store the remote data granules
        provider: if we download a list of URLs, we need to specify the provider.
        threads: parallel number of threads to use to download the files, adjust as necessary, default = 8

    Returns:
        List of downloaded files

    Raises:
        Exception: A file download failed.
    """
    provider = _normalize_location(provider)
    pqdm_kwargs = {
        "exception_behavior": "immediate",
        "n_jobs": threads,
        **pqdm_kwargs,
    }
    if isinstance(granules, DataGranule):
        granules = [granules]
    elif isinstance(granules, str):
        granules = [granules]
    try:
        results = earthaccess.__store__.get(
            granules, local_path, provider, threads, pqdm_kwargs
        )    
    except AttributeError as err:
        logger.error(
            f"{err}: You must call earthaccess.login() before you can download data"
        )
        return []
    return results

Just for clarification, the public API function download would look something like this. The pqdm_kwargs would be taken in at the user interface and then passed on to the lower level functions in the entire call chain isn't it? or am I missing something?

Sherwin-14 avatar Sep 28 '24 10:09 Sherwin-14

Correct, but you almost never want to use mutable values for default values, and this is one of those cases (among other references, see https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments).

So, do NOT do this:

def download(
    granules: Union[DataGranule, List[DataGranule], str, List[str]],
    local_path: Optional[str],
    provider: Optional[str] = None,
    threads: int = 8,
    pqdm_kwargs: dict = {}
) -> List[str]:
    pqdm_kwargs = {
        "exception_behavior": "immediate",
        "n_jobs": threads,
        **pqdm_kwargs,
    }

Instead, DO THIS:

def download(
    granules: Union[DataGranule, List[DataGranule], str, List[str]],
    local_path: Optional[str],
    provider: Optional[str] = None,
    threads: int = 8,
    pqdm_kwargs: Optional[Mapping[str, Any]] = None,
) -> List[str]:
    pqdm_kwargs = {
        "exception_behavior": "immediate",
        "n_jobs": threads,
        **(pqdm_kwargs or {}),
    }

chuckwondo avatar Sep 28 '24 13:09 chuckwondo

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 50.00000% with 27 lines in your changes missing coverage. Please review.

Project coverage is 68.78%. Comparing base (2f49c08) to head (1061743). Report is 69 commits behind head on main.

Files with missing lines Patch % Lines
earthaccess/store.py 0.00% 22 Missing :warning:
earthaccess/api.py 37.50% 5 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

:exclamation: There is a different number of reports uploaded between BASE (2f49c08) and HEAD (1061743). Click for more details.

HEAD has 6 uploads less than BASE
Flag BASE (2f49c08) HEAD (1061743)
14 8
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #792      +/-   ##
==========================================
- Coverage   73.88%   68.78%   -5.11%     
==========================================
  Files          31       32       +1     
  Lines        2003     2153     +150     
==========================================
+ Hits         1480     1481       +1     
- Misses        523      672     +149     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Sep 29 '24 14:09 codecov-commenter

@Sherwin-14, several other PRs have landed on main, so please also be sure to merge the changes from main and resolve conflicts.

chuckwondo avatar Oct 06 '24 14:10 chuckwondo

@chuckwondo Hey Chuck, I have made the appropriate changes but I haven't pushed the code yet since I was writing unittest for this one which you suggested earlier. I will re-request for a review once I feel this is ready to be reviewed

Sherwin-14 avatar Oct 06 '24 14:10 Sherwin-14

@chuckwondo Hey Chuck, I have made the appropriate changes but I haven't pushed the code yet since I was writing unittest for this one which you suggested earlier. I will re-request for a review once I feel this is ready to be reviewed

Sounds good, but please wait to mark any suggestions as resolved until you push the corresponding code changes, otherwise it is confusing to others.

chuckwondo avatar Oct 06 '24 15:10 chuckwondo

class TestEarthAccessDownload(unittest.TestCase):
    def test_download(self):

        earthaccess.login()

        results = earthaccess.search_data(
            short_name='ATL06',  # ATLAS/ICESat-2 L3A Land Ice Height
            bounding_box=(-10, 20, 10, 50),  # Only include files in area of interest...
            temporal=("1999-02", "2019-03"),  # ...and time period of interest.
            count=10
        )


        with patch.object(earthaccess, '__store__', return_value= Mock()) as mock_store:
            with patch.object(mock_store, 'get') as mock_get:
    
                mock_get.side_effect = Exception('Download failed')
    
     
                files = earthaccess.download(results, "/home/download-folder")
    
                self.assertEqual(files, [])

if __name__ == '__main__':
    unittest.main()

This is the test function I came up with and this is giving me an error log that looks something like this. Is this the thing we want to achieve at this point? Also if the test is correct then where should I place it should I make a completely new file for this like test_api.py?

Sherwin-14 avatar Oct 27 '24 07:10 Sherwin-14

Hi @Sherwin-14. Thanks for working on the unit tests for this PR.

The first thing I suggest you do is merge the latest from main into your branch, so you're working with the latest code.

Regarding the test you've written, it looks like you're generally heading in the right direction.

I have some suggestions:

  • Sure, if you want to put your new tests in a new tests/unit/test_api.py file, that sounds good to me.
  • Do not use the unittest library. There's no need for the additional "overhead". Simply use top-level test functions and built-in assert statements. We have almost completely moved away from unittest in our integration tests, and we have a mix in our unit tests, but we should continue moving away from unittest (although I suspect this may be up for debate).
  • Regarding your test shown above, that looks good, but since the new behavior is for the call to download to fail immediately upon any single download failure (i.e., pqdm exception_behavior="immediate"), then you will want to put your call to download in a context manager, as by default we expect an exception to be raised (as shown in your linked test failure output), thus we do not expect to be able to assert that we get back an empty list of results:
        with patch.object(earthaccess, '__store__', return_value= Mock()) as mock_store:
            with patch.object(mock_store, 'get') as mock_get:
                mock_get.side_effect = Exception('Download failed')

                with pytest.raises(Exception, match="Download failed"):
                    earthaccess.download(results, "/home/download-folder")

chuckwondo avatar Oct 27 '24 14:10 chuckwondo

I have taken out unittest but it seems that it is necessary to use unittest for importing the likes of Mock. Another option here would be to use the pytest mock library but that means we will need to add an additional dependency.

Sherwin-14 avatar Oct 28 '24 13:10 Sherwin-14

I have taken out unittest but it seems that it is necessary to use unittest for importing the likes of Mock. Another option here would be to use the pytest mock library but that means we will need to add an additional dependency.

You can still use unittest Mocks, but there's no need for creating a subclass of unittest.TestCase.

chuckwondo avatar Oct 28 '24 14:10 chuckwondo

Binder :point_left: Launch a binder notebook on this branch for commit d22f2acfb774c4875da4ec0d1620b9da7fdf2dd6

I will automatically update this comment whenever this PR is modified

Binder :point_left: Launch a binder notebook on this branch for commit 374f3033bc1dc2698e21975a2b59d6161c4effc7

Binder :point_left: Launch a binder notebook on this branch for commit de8df1cea2b460fd58629e5e3584e971273c1434

Binder :point_left: Launch a binder notebook on this branch for commit 767e11982590a1f38bc68f509c53cba18d645f9f

Binder :point_left: Launch a binder notebook on this branch for commit 2d485b3a6ad49bba32a1cba6ec3b9c6d2aae4c73

Binder :point_left: Launch a binder notebook on this branch for commit 1061743d7cbc61ea8f7c2b5be4d6968b727f25d0

github-actions[bot] avatar Oct 28 '24 14:10 github-actions[bot]

User pre-commit-ci[bot] does not have permission to run integration tests. A maintainer must perform a security review of the code changes in this pull request and re-run the failed integration tests jobs, if the code is deemed safe.

github-actions[bot] avatar Nov 05 '24 00:11 github-actions[bot]