earthaccess icon indicating copy to clipboard operation
earthaccess copied to clipboard

the `earthaccess.EarthAccessFile` wrapper need not subclass anything

Open itcarroll opened this issue 1 year ago • 1 comments

Fixes #610, closes #563.

This PR removes any base class from the definition of earthaccess.EarthAccessFile (EAF). Previously, EAF inherited from fsspec.spec.AbstractBufferedFile (ABF) so was capable of using methods defined on ABF. But EAF held an instance of an ABF at self.f and handed off __getattr__ requests to that object. Under this setup, self.read returns super().read if read is defined on ABF (and read is defined on ABF) else self.read returns self.f.read. That is a bug. It was probably assumed that __getattr__ would catch all method calls, but it only handles what __getattribute__ can't find.

We've scraped by with this setup because self.f is also an ABF and either does not override ABF on a called method or the override does little more than itself call super(). The latter is the case for self.f.read when f is a fsspec.implementations.http.HTTPFile. It is not the case when f is a fsspec.implementations.http.HTTPStreamFile.

This PR also updates some type hints and relevant documentation.

  • the type hint on f was wrong, it is an ABF not a fsspec.AbstractFileSystem
  • type hints previously given as an ABF are now given as EAF (b/c it is no longer an instance of ABF)
  • the EAF is added to mkdocs un Modules/Store

ToDo if integration tests look okay:

  • [x] add to changelog
  • [x] check for any changes needed to docs

📚 Documentation preview 📚: https://earthaccess--620.org.readthedocs.build/en/620/

itcarroll avatar Jun 26 '24 03:06 itcarroll

I need to play with this a little bit to better understand what's going on, but I may not have time until the next hack day.

mfisher87 avatar Jul 09 '24 18:07 mfisher87

I need to play with this a little bit to better understand what's going on, but I may not have time until the next hack day.

@mfisher87, have you had a chance to do this?

@itcarroll or @betolink, I suppose the larger question for me is, what's the point of this class to begin with? Why do we even need it?

chuckwondo avatar Sep 03 '24 10:09 chuckwondo

Thanks for checking on this one @chuckwondo! My guess on the need for this class was something to do with deserializing into a useable object in case authentication timed out. A guess only though, as I don't know when EarthAccessFile.__reduce__ would be called.

itcarroll avatar Sep 03 '24 12:09 itcarroll

@chuckwondo

I suppose the larger question for me is, what's the point of this class to begin with? Why do we even need it?

@jrbourbeau can explain in detail but the gist of it is that this class allows a serialization trick, if we open granules from our laptop but we are offloading an xarray operation to a Dask cluster in us-west-2, it will re-open the files in place using s3://url instead of the https://cloud-front-tea url. I think James mentioned that the speed improvement was ~2x vs only using HTTPS.

betolink avatar Sep 03 '24 13:09 betolink

@jrbourbeau The essential question is whether removing fsspec.spec.AbstractBufferedFile from the MRO of EarthAccessFile will break the usage @betolink describes. I don't think we have a test using coiled. (We still have the file-like instance attached as the f attribute and used in the __getattr__ redirection.)

itcarroll avatar Sep 03 '24 18:09 itcarroll

I suppose the larger question for me is, what's the point of this class to begin with? Why do we even need it?

@jrbourbeau can explain in detail but the gist of it is that this class allows a serialization trick, if we open granules from our laptop but we are offloading an xarray operation to a Dask cluster in us-west-2, it will re-open the files in place using s3://url instead of the https://cloud-front-tea url. I think James mentioned that the speed improvement was ~2x vs only using HTTPS.

I didn't notice the make_instance function earlier, but now that I've looked at it, your description is now more clear to me.

However, I'm questioning the whole idea of pickling the fsspec.spec.AbstractBufferedFile to begin with. Opening files from your laptop, then distributing the open files across a cluster seems like you may be inviting more trouble than it's worth. Pickling potentially oddly complex objects to distribute them to other processes can be more problematic than passing around simpler things, such as the URLs from which the files were opened, rather than the opened files themselves.

I may very well be thinking about this poorly, or simply be too inexperienced with using dask (and the like), but without seeing an example of the types of things you think this would be helpful for, offhand I would ask you why you're not simply distributing the URLs rather than the open files?

Is it so that you don't have to also potentially distribute credentials across such clusters as well?

chuckwondo avatar Sep 03 '24 23:09 chuckwondo

have you had a chance to do this?

I have not yet, sorry :X

mfisher87 avatar Sep 05 '24 23:09 mfisher87

Welcome to Codecov :tada:

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered :open_umbrella:

codecov-commenter avatar Sep 23 '24 15:09 codecov-commenter

I suppose the larger question for me is, what's the point of this class to begin with? Why do we even need it?

@jrbourbeau can explain in detail but the gist of it is that this class allows a serialization trick, if we open granules from our laptop but we are offloading an xarray operation to a Dask cluster in us-west-2, it will re-open the files in place using s3://url instead of the https://cloud-front-tea url. I think James mentioned that the speed improvement was ~2x vs only using HTTPS.

I didn't notice the make_instance function earlier, but now that I've looked at it, your description is now more clear to me.

However, I'm questioning the whole idea of pickling the fsspec.spec.AbstractBufferedFile to begin with. Opening files from your laptop, then distributing the open files across a cluster seems like you may be inviting more trouble than it's worth. Pickling potentially oddly complex objects to distribute them to other processes can be more problematic than passing around simpler things, such as the URLs from which the files were opened, rather than the opened files themselves.

I may very well be thinking about this poorly, or simply be too inexperienced with using dask (and the like), but without seeing an example of the types of things you think this would be helpful for, offhand I would ask you why you're not simply distributing the URLs rather than the open files?

Is it so that you don't have to also potentially distribute credentials across such clusters as well?

Regarding the specific changes in the PR, they look fine to me, but is anybody able to answer my question above?

chuckwondo avatar Sep 23 '24 15:09 chuckwondo

pre-commit.ci autofix

itcarroll avatar Sep 23 '24 16:09 itcarroll

@chuckwondo:

Pickling potentially oddly complex objects to distribute them to other processes can be more problematic than passing around simpler things, such as the URLs from which the files were opened, rather than the opened files themselves.

Some complexity will be needed to handle the s3 <--> https hand-off between laptops and in-region workers spun up by coiled. Here it's a pickled file-like object, but additional in_region checks in __store__._get_urls could be a better approach. It looks like you are okay with making that a separate issue for an enhancement, rather than addressing it in this bug fix.

itcarroll avatar Sep 23 '24 20:09 itcarroll

No worries @jrbourbeau, thanks for the review!

Though not sure exactly what's going on with the integration tests

Someone with write-access needs to trigger them is all.

itcarroll avatar Sep 23 '24 20:09 itcarroll

No worries @jrbourbeau, thanks for the review!

Though not sure exactly what's going on with the integration tests

Someone with write-access needs to trigger them is all.

This is not yet possible. Once #808 lands, this will be possible. If you are willing to wait for #808 to be approved and merged to main, then you could merge main into your PR, and this will retrigger int. tests, which will then fail again, but then a maintainer will be able to re-run the failed int. tests so that they can pass (assuming nothing was broken).

chuckwondo avatar Sep 23 '24 20:09 chuckwondo

Someone with write-access needs to trigger them is all This is not yet possible

Not sure exactly what the state of integrations tests are but FWIW I went ahead and bumped those failing CI jobs and they're passing now

jrbourbeau avatar Sep 23 '24 20:09 jrbourbeau

Someone with write-access needs to trigger them is all This is not yet possible

Not sure exactly what the state of integrations tests are but FWIW I went ahead and bumped those failing CI jobs and they're passing now

LOL! Apologies. I'm getting my PRs mixed up. I forgot that part already landed 🤦

chuckwondo avatar Sep 23 '24 21:09 chuckwondo