earthaccess
earthaccess copied to clipboard
Added parameters and exception behaviour to pqdm
I did all the steps outlined in the issue #581, we can now build upon this.
📚 Documentation preview 📚: https://earthaccess--792.org.readthedocs.build/en/792/
@mfisher87, I see lots of Ruff error messages regarding a number of files that @Sherwin-14 has not touched in this PR. Is there something already on main to address these errors, such that Sherwin simply needs to merge main into his own branch to silence them?
Not yet, but that should be an easy fix. I'll take care of it now :) https://github.com/nsidc/earthaccess/issues/788
https://github.com/nsidc/earthaccess/pull/793
Not yet, but that should be an easy fix. I'll take care of it now :) #788 #793
Excellent! Thank you for the quick fix.
@Sherwin-14, Matt's fix is now merged into main, so please update your branch with the fix from main, which will eliminate the ruff errors in your build.
@Sherwin-14, do you have enough information to proceed with the changes I requested?
@Sherwin-14, do you have enough information to proceed with the changes I requested?
Yeah I have the info I needed, I had put this on hold since I was preoccupied with some other things for the last couple of days. I'll resume my work on this asap.
Would it be possible and in scope of this PR to use **kwargs and pass them to pqdm for both open() and download() with sane defaults? Users could be in control of the failing behavior but also the number of threads to use and the verbosity (don't display the progress bar). I'm not sure about the style, in Xarray they use **kwargs and an equivalent backend_kwargs={} that in our case could be something like pqdm_opts={}
# code in api.py and something similar in store.py
# in this example we leave threads out to ensure backwards compatibility and will use it in the store class if present
def open(granules, threads, provider, pqdm_opts):
default_pqdm_opts = {
n_jobs=threads,
exception_behaviour='immediate'
colour="green",
disable=False # True means no output including the progress bar.
}
if not pqdm_opts;
pqdm_opts = default_pqdm_opts
earthaccess.__store__.open(granules, treads, pqdm_opts)
and actually we should do the same with fsspec so we let users tune their caching behavior when opening the files, aka a "smart_open". What do you all think?
I had put this on hold since I was preoccupied with some other things for the last couple of days. I'll resume my work on this asap.
@Sherwin-14 No rush! We really appreciate any time you can spare, and don't want to ask for more than that :)
@betolink I really like that idea. Basically "Allow customizing all pqdm behaviors" (and separately, "Allow customizing all fsspec behaviors")?
I'm not sure how I feel about including that as part of this PR. I lean towards separate out of respect for @Sherwin-14 's time, but I don't want to speak for you, Sherwin!
We decided during hack day today that we want to allow users to customize pqdm behaviors like so:
def open(something, threads=16, *, pqdm_kwargs={...}, ...):
pqdm_kwargs_defaults = {"exception_behavior": "immediate", "threads": threads}
pqdm_kwargs = pqdm_kwargs_defaults.update(pqdm_kwargs)
...
We considered the threads argument as fundamental to the function of earthaccess, so we think it should remain a top-level parameter. We'll use threads whether we use pqdm or not in 5 years.
We decided during hack day today that we want to allow users to customize pqdm behaviors like so:
def open(something, threads=16, *, pqdm_kwargs={...}, ...): pqdm_kwargs_defaults = {"exception_behavior": "immediate", "threads": threads} pqdm_kwargs = pqdm_kwargs_defaults.update(pqdm_kwargs) ...We considered the
threadsargument as fundamental to the function of earthaccess, so we think it should remain a top-level parameter. We'll use threads whether we use pqdm or not in 5 years.
How should we now proceed with the changes we discussed during the hackday?
We decided during hack day today that we want to allow users to customize pqdm behaviors like so:
def open(something, threads=16, *, pqdm_kwargs={...}, ...): pqdm_kwargs_defaults = {"exception_behavior": "immediate", "threads": threads} pqdm_kwargs = pqdm_kwargs_defaults.update(pqdm_kwargs) ...We considered the
threadsargument as fundamental to the function of earthaccess, so we think it should remain a top-level parameter. We'll use threads whether we use pqdm or not in 5 years.How should we now proceed with the changes we discussed during the hackday?
Do the same thing you did in the PR you already started for this, but instead of fail_fast (which is what you pretty much completed) change it to pqdm_kwargs and adjust your code changes similar to what is shown above, although the code above won't work because pqdm has an n_jobs arg, not a threads arg, so I suggest something like so, instead, within the functions/methods that invoke pqdm:
pqdm_kwargs = {
"exception_behavior": "immediate",
"n_jobs": threads,
**pqdm_kwargs,
}
...
pqdm(<other args>, **pqdm_kwargs)
Great point! Much clearer with the double splat too ;)
@Sherwin-14, do you need some help with this?
@Sherwin-14, do you need some help with this?
I will try to accomodate the change asap, I was caught up in something else up untill now
@Sherwin-14, do you need some help with this?
I will try to accomodate the change asap, I was caught up in something else up untill now
No rush. I just wanted to check in to see if you need any help or if something else was requiring you to postpone your continuation.
@chuckwondo
def download(
granules: Union[DataGranule, List[DataGranule], str, List[str]],
local_path: Optional[str],
provider: Optional[str] = None,
threads: int = 8,
pqdm_kwargs: dict = {}
) -> List[str]:
"""Retrieves data granules from a remote storage system.
* If we run this in the cloud, we will be using S3 to move data to `local_path`.
* If we run it outside AWS (us-west-2 region) and the dataset is cloud hosted,
we'll use HTTP links.
Parameters:
granules: a granule, list of granules, a granule link (HTTP), or a list of granule links (HTTP)
local_path: local directory to store the remote data granules
provider: if we download a list of URLs, we need to specify the provider.
threads: parallel number of threads to use to download the files, adjust as necessary, default = 8
Returns:
List of downloaded files
Raises:
Exception: A file download failed.
"""
provider = _normalize_location(provider)
pqdm_kwargs = {
"exception_behavior": "immediate",
"n_jobs": threads,
**pqdm_kwargs,
}
if isinstance(granules, DataGranule):
granules = [granules]
elif isinstance(granules, str):
granules = [granules]
try:
results = earthaccess.__store__.get(
granules, local_path, provider, threads, pqdm_kwargs
)
except AttributeError as err:
logger.error(
f"{err}: You must call earthaccess.login() before you can download data"
)
return []
return results
Just for clarification, the public API function download would look something like this. The pqdm_kwargs would be taken in at the user interface and then passed on to the lower level functions in the entire call chain isn't it? or am I missing something?
Correct, but you almost never want to use mutable values for default values, and this is one of those cases (among other references, see https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments).
So, do NOT do this:
def download(
granules: Union[DataGranule, List[DataGranule], str, List[str]],
local_path: Optional[str],
provider: Optional[str] = None,
threads: int = 8,
pqdm_kwargs: dict = {}
) -> List[str]:
pqdm_kwargs = {
"exception_behavior": "immediate",
"n_jobs": threads,
**pqdm_kwargs,
}
Instead, DO THIS:
def download(
granules: Union[DataGranule, List[DataGranule], str, List[str]],
local_path: Optional[str],
provider: Optional[str] = None,
threads: int = 8,
pqdm_kwargs: Optional[Mapping[str, Any]] = None,
) -> List[str]:
pqdm_kwargs = {
"exception_behavior": "immediate",
"n_jobs": threads,
**(pqdm_kwargs or {}),
}
:warning: Please install the to ensure uploads and comments are reliably processed by Codecov.
Codecov Report
Attention: Patch coverage is 50.00000% with 27 lines in your changes missing coverage. Please review.
Project coverage is 68.78%. Comparing base (
2f49c08) to head (1061743). Report is 69 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| earthaccess/store.py | 0.00% | 22 Missing :warning: |
| earthaccess/api.py | 37.50% | 5 Missing :warning: |
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
:exclamation: There is a different number of reports uploaded between BASE (2f49c08) and HEAD (1061743). Click for more details.
HEAD has 6 uploads less than BASE
Flag BASE (2f49c08) HEAD (1061743) 14 8
Additional details and impacted files
@@ Coverage Diff @@
## main #792 +/- ##
==========================================
- Coverage 73.88% 68.78% -5.11%
==========================================
Files 31 32 +1
Lines 2003 2153 +150
==========================================
+ Hits 1480 1481 +1
- Misses 523 672 +149
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@Sherwin-14, several other PRs have landed on main, so please also be sure to merge the changes from main and resolve conflicts.
@chuckwondo Hey Chuck, I have made the appropriate changes but I haven't pushed the code yet since I was writing unittest for this one which you suggested earlier. I will re-request for a review once I feel this is ready to be reviewed
@chuckwondo Hey Chuck, I have made the appropriate changes but I haven't pushed the code yet since I was writing unittest for this one which you suggested earlier. I will re-request for a review once I feel this is ready to be reviewed
Sounds good, but please wait to mark any suggestions as resolved until you push the corresponding code changes, otherwise it is confusing to others.
class TestEarthAccessDownload(unittest.TestCase):
def test_download(self):
earthaccess.login()
results = earthaccess.search_data(
short_name='ATL06', # ATLAS/ICESat-2 L3A Land Ice Height
bounding_box=(-10, 20, 10, 50), # Only include files in area of interest...
temporal=("1999-02", "2019-03"), # ...and time period of interest.
count=10
)
with patch.object(earthaccess, '__store__', return_value= Mock()) as mock_store:
with patch.object(mock_store, 'get') as mock_get:
mock_get.side_effect = Exception('Download failed')
files = earthaccess.download(results, "/home/download-folder")
self.assertEqual(files, [])
if __name__ == '__main__':
unittest.main()
This is the test function I came up with and this is giving me an error log that looks something like this. Is this the thing we want to achieve at this point? Also if the test is correct then where should I place it should I make a completely new file for this like test_api.py?
Hi @Sherwin-14. Thanks for working on the unit tests for this PR.
The first thing I suggest you do is merge the latest from main into your branch, so you're working with the latest code.
Regarding the test you've written, it looks like you're generally heading in the right direction.
I have some suggestions:
- Sure, if you want to put your new tests in a new
tests/unit/test_api.pyfile, that sounds good to me. - Do not use the unittest library. There's no need for the additional "overhead". Simply use top-level test functions and built-in
assertstatements. We have almost completely moved away from unittest in our integration tests, and we have a mix in our unit tests, but we should continue moving away from unittest (although I suspect this may be up for debate). - Regarding your test shown above, that looks good, but since the new behavior is for the call to
downloadto fail immediately upon any single download failure (i.e., pqdmexception_behavior="immediate"), then you will want to put your call todownloadin a context manager, as by default we expect an exception to be raised (as shown in your linked test failure output), thus we do not expect to be able to assert that we get back an empty list of results:
with patch.object(earthaccess, '__store__', return_value= Mock()) as mock_store:
with patch.object(mock_store, 'get') as mock_get:
mock_get.side_effect = Exception('Download failed')
with pytest.raises(Exception, match="Download failed"):
earthaccess.download(results, "/home/download-folder")
I have taken out unittest but it seems that it is necessary to use unittest for importing the likes of Mock. Another option here would be to use the pytest mock library but that means we will need to add an additional dependency.
I have taken out unittest but it seems that it is necessary to use unittest for importing the likes of Mock. Another option here would be to use the pytest mock library but that means we will need to add an additional dependency.
You can still use unittest Mocks, but there's no need for creating a subclass of unittest.TestCase.
:point_left: Launch a binder notebook on this branch for commit d22f2acfb774c4875da4ec0d1620b9da7fdf2dd6
I will automatically update this comment whenever this PR is modified
:point_left: Launch a binder notebook on this branch for commit 374f3033bc1dc2698e21975a2b59d6161c4effc7
:point_left: Launch a binder notebook on this branch for commit de8df1cea2b460fd58629e5e3584e971273c1434
:point_left: Launch a binder notebook on this branch for commit 767e11982590a1f38bc68f509c53cba18d645f9f
:point_left: Launch a binder notebook on this branch for commit 2d485b3a6ad49bba32a1cba6ec3b9c6d2aae4c73
:point_left: Launch a binder notebook on this branch for commit 1061743d7cbc61ea8f7c2b5be4d6968b727f25d0
User pre-commit-ci[bot] does not have permission to run integration tests. A maintainer must perform a security review of the code changes in this pull request and re-run the failed integration tests jobs, if the code is deemed safe.