astroquery icon indicating copy to clipboard operation
astroquery copied to clipboard

WIP: MAST query result cache support outline

Open orionlee opened this issue 6 years ago β€’ 37 comments
trafficstars

Closes #1577 .

So far this is an outline of the proposed change. I'd like to use the PR to solicit feedback (and gauge the efforts required) before proceeding further.

orionlee avatar Oct 25 '19 19:10 orionlee

Hello @orionlee! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2019-11-05 00:33:35 UTC

pep8speaks avatar Oct 25 '19 19:10 pep8speaks

Proposed changes outline. @ceb8 @barentsen feedback welcome.

  1. mast.Observations.query_criteria_async() : add optional parameters cache=False, cache_opts=None
  2. The cache parameters propagate to a new function, MastClass._request_w_cache(), which wraps around existing MastClass._request() function and implements caching.
  3. Caching can be similarly exposed to other MAST query API functions.

Discussion Points:

  1. The actual caching mechanism: follow the base astropy.BaseQuery's pattern. See its BaseQuery._request() function: https://github.com/astropy/astroquery/blob/ebdf195207a50a0b71efa4ba52556f25aa8f4e67/astroquery/query.py#L239-L248

  2. Caching policy (expiration, etc.): TBD. I don't know astropy API well enough yet, but it should reuse existing astropy convention when possible.

orionlee avatar Oct 25 '19 20:10 orionlee

Codecov Report

Merging #1578 (d949888) into main (df3b6fc) will decrease coverage by 6.08%. The diff coverage is 100.00%.

:exclamation: Current head d949888 differs from pull request most recent head 1aa7078. Consider uploading reports for the commit 1aa7078 to get more accurate results

@@            Coverage Diff             @@
##             main    #1578      +/-   ##
==========================================
- Coverage   69.18%   63.11%   -6.08%     
==========================================
  Files         304      133     -171     
  Lines       22529    17348    -5181     
==========================================
- Hits        15587    10949    -4638     
+ Misses       6942     6399     -543     
Impacted Files Coverage Ξ”
astroquery/mast/discovery_portal.py 69.65% <100.00%> (+3.14%) :arrow_up:
astroquery/mast/observations.py 76.01% <100.00%> (+0.59%) :arrow_up:
astroquery/query.py 58.97% <100.00%> (-3.43%) :arrow_down:

... and 245 files with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Oct 25 '19 20:10 codecov[bot]

Caching policy (expiration, etc.): TBD. I don't know astropy API well enough yet, but it should reuse existing astropy convention when possible.

we don't yet have a good solution to that.

bsipocz avatar Oct 25 '19 21:10 bsipocz

Basic caching support (cache=True / False, on par with astroquery) is added to mast.Observations.query_criteria_async() for review.

Please review before I proceed any further. It it's agreed, similar pattern can be applied to other MAST query functions.

Notes from my side:

  • a change is added to the generic AstroQuery.from_cache() function, to accommodate MAST response (which is a list of Response rather than a single one). It might be more desirable to refactor the caching logic so that it can be used independent of AstroQuery, so that MAST-specific behavior can be accommodated more easily without adding to base AstroQuery, or duplicating the caching logic.

orionlee avatar Nov 05 '19 00:11 orionlee

I don't yet see why the tests are failing, it looks very unrelated and something I couldn't yet reproduce locally.

Also, I'm travelling this week, but will try to get back to this and do a reasonable review soon.

bsipocz avatar Nov 06 '19 22:11 bsipocz

For failing tests: for now I assume that they are due to some transient issues with CI, and expect they will recover in the next push.

orionlee avatar Nov 06 '19 23:11 orionlee

(Just wanted to subscribe and say I am super excited about this PR, and will be happy to help test.)

barentsen avatar Nov 07 '19 03:11 barentsen

I'm swamped through Monday, but should have time to test this later next week.

ceb8 avatar Nov 07 '19 13:11 ceb8

This implements a really useful feature. At the moment I frequently query MAST in order to get the same "search result" as part of my pipeline to iterate through a large list of objects, even though the files themselves are cached. Is there anyway I can help get this merged?

christinahedges avatar Jan 29 '20 18:01 christinahedges

@christinahedges I will try to make time next week to look at this PR.

I am also in general concerned with the lack of an expiration on the cache, which has bitten me before when for example a colleague and I queried Ned repeatedly, not understanding why we were getting different results and it turned out that one of us had made the same query a year prior (according to the date on the cache file) and were still accessing that very old search result. This has also come up within astroquery.mast with Tesscut which does use astroquery caching, resulting in users not finding new results (there are TESS releases ~monthly).

ceb8 avatar Jan 29 '20 19:01 ceb8

@ceb8 I think this is a really good point. In particular if users query for e.g. TESS data, there will potentially be new targets and files every month. For the most part, I want this functionality so that when I re-run my scripts 5 times a day, I'm not waiting for a result from MAST. If we had an expiration date of e.g. 7 days, that would totally meet my needs.

christinahedges avatar Jan 29 '20 19:01 christinahedges

@christinahedges Yeah, I think ideally the cache will have a default timeout on the short scale, the user will have the ability to set it to a user specified value, and there will always be a no-cache option.

ceb8 avatar Jan 29 '20 19:01 ceb8

@ceb8 - There is a cascade of issues for caching, so certainly there is a need to get the caching solved for many modules: https://github.com/astropy/astroquery/issues/1511 https://github.com/astropy/astroquery/issues/1531 and even a very old PR, that may or may not be suitable to be thawed up: https://github.com/astropy/astroquery/pull/442

cc @keflavich @jwoillez

bsipocz avatar Jan 29 '20 20:01 bsipocz

@bsipocz I will try to look at the caching situation overall next week then, and gather what needs to be done and make a plan. πŸ˜ƒ

ceb8 avatar Jan 29 '20 21:01 ceb8

@ceb8 and @bsipocz I really appreciate it! This will be such great functionality, and will hopefully stop me pinging MAST 100 times a day...! Let me know if I can do anything to help!

christinahedges avatar Jan 29 '20 21:01 christinahedges

@bsipocz @christinahedges Opened WIP PR https://github.com/astropy/astroquery/pull/1634 to address the general caching situation. Will look at this PR next.

ceb8 avatar Feb 05 '20 15:02 ceb8

That is awesome thank you! I can't wait for this functionality! πŸŽ‰ πŸŽ‰ πŸŽ‰ πŸŽ‰ πŸŽ‰ πŸŽ‰ πŸŽ‰

christinahedges avatar Feb 05 '20 18:02 christinahedges

per @ceb , this PR will wait until #1634 (cache expiration policy, etc.) completes.

orionlee avatar Aug 26 '20 00:08 orionlee

@orionlee @ceb8 - it would be interesting to get this rebased on top of the (likely finished) #1634 as a way to provide some extra testing for that PR.

bsipocz avatar Aug 17 '22 22:08 bsipocz

@bsipocz Excellent Idea, I'll work on this next.

ceb8 avatar Sep 08 '22 14:09 ceb8

This PR predates a large refactor of the MAST code, so I had to do a manual rebase and force push.

ceb8 avatar Sep 27 '22 16:09 ceb8

@ceb8 - can you rebase this on top of https://github.com/astropy/astroquery/pull/1634, so to see it works with the refactored machinery?

bsipocz avatar Sep 27 '22 16:09 bsipocz

(then this PR will contain the changes from that one, etc, but we'll sort that out along the way).

Also, this PR is very much considered WIP/draft as part of testing the other one, so please everyone, no detailed code reviews until we sort out the big picture!

bsipocz avatar Sep 27 '22 16:09 bsipocz

Hmm, I think something went bad during the rebase. did you try to do it interactively, and keep only the one commit from this branch, rather than the ones from main, too?

bsipocz avatar Sep 27 '22 17:09 bsipocz

@bsipocz Found some interesting things:

  • Only one small tweak (latest commit) required and it's caching as promised (yay)
  • Because of the way we determine cache location and the way the MAST module is written, the cache directory is not the name of the top level class the user interacts with, which is definitely a problem. Not sure if this should be solved by the MAST module or if I need to tweak the caching infrastructure. I will explore.

ceb8 avatar Sep 29 '22 16:09 ceb8

Because of the way we determine cache location and the way the MAST module is written, the cache directory is not the name of the top level class the user interacts with, which is definitely a problem. Not sure if this should be solved by the MAST module or if I need to tweak the caching infrastructure. I will explore.

can you elaborate on this with an example? There are problems with the MAST (and gemini) class naming already, I wonder whether this would be a non-issue if that one is solved or are unrelated.

bsipocz avatar Sep 29 '22 17:09 bsipocz

@bsipocz It's unrelated. Basically MAST hands off querying to a difference class depending on what interface is needed, and that class does the caching, rather than the top level class the user queried.

Example:

from astroquery.mast import Observations

obs_table = Observations.query_criteria(dataproduct_type=["image"],
                                        proposal_pi="Osten*",
                                        s_dec=[43.5,45.5], cache=True)

In this example you would expect the cache location to be Observations.cache_location (.astropy/cache/astroquery/Observations) but it actually caches the file in .astropy/cache/astroquery/PortalAPI because the PortalAPI class is the one that handles this type of query.

ceb8 avatar Sep 29 '22 17:09 ceb8

In this example you would expect the cache location to be Observations.cache_location (.astropy/cache/astroquery/Observations) but it actually caches the file in .astropy/cache/astroquery/PortalAPI because the PortalAPI class is the one that handles this type of query.

Yeap, this is pretty bad. At least the issue of having two Observations class is known and in the works to fix it with a very long deprecation, it still won't solve the main issue.

Would you think adding one or more layers help? Or maybe the full namespace would be the solution, as we have substructure in a few modules (though caching everything esa together, and everything ipac, and everything solarsystem, etc doesn't feel that wrong as having just one flat layer.

bsipocz avatar Sep 30 '22 02:09 bsipocz

btw, this is exactly the issue that I hoped to be able to recover before merging the other PR.

bsipocz avatar Sep 30 '22 02:09 bsipocz