astroquery
astroquery copied to clipboard
WIP: MAST query result cache support outline
Closes #1577 .
So far this is an outline of the proposed change. I'd like to use the PR to solicit feedback (and gauge the efforts required) before proceeding further.
Hello @orionlee! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:
Comment last updated at 2019-11-05 00:33:35 UTC
Proposed changes outline. @ceb8 @barentsen feedback welcome.
mast.Observations.query_criteria_async(): add optional parameterscache=False,cache_opts=None- The cache parameters propagate to a new function,
MastClass._request_w_cache(), which wraps around existingMastClass._request()function and implements caching. - Caching can be similarly exposed to other MAST query API functions.
Discussion Points:
-
The actual caching mechanism: follow the base
astropy.BaseQuery's pattern. See itsBaseQuery._request()function: https://github.com/astropy/astroquery/blob/ebdf195207a50a0b71efa4ba52556f25aa8f4e67/astroquery/query.py#L239-L248 -
Caching policy (expiration, etc.): TBD. I don't know astropy API well enough yet, but it should reuse existing astropy convention when possible.
Codecov Report
Merging #1578 (d949888) into main (df3b6fc) will decrease coverage by
6.08%. The diff coverage is100.00%.
:exclamation: Current head d949888 differs from pull request most recent head 1aa7078. Consider uploading reports for the commit 1aa7078 to get more accurate results
@@ Coverage Diff @@
## main #1578 +/- ##
==========================================
- Coverage 69.18% 63.11% -6.08%
==========================================
Files 304 133 -171
Lines 22529 17348 -5181
==========================================
- Hits 15587 10949 -4638
+ Misses 6942 6399 -543
| Impacted Files | Coverage Ξ | |
|---|---|---|
| astroquery/mast/discovery_portal.py | 69.65% <100.00%> (+3.14%) |
:arrow_up: |
| astroquery/mast/observations.py | 76.01% <100.00%> (+0.59%) |
:arrow_up: |
| astroquery/query.py | 58.97% <100.00%> (-3.43%) |
:arrow_down: |
... and 245 files with indirect coverage changes
:mega: Weβre building smart automated test selection to slash your CI/CD build times. Learn more
Caching policy (expiration, etc.): TBD. I don't know astropy API well enough yet, but it should reuse existing astropy convention when possible.
we don't yet have a good solution to that.
Basic caching support (cache=True / False, on par with astroquery) is added to
mast.Observations.query_criteria_async() for review.
Please review before I proceed any further. It it's agreed, similar pattern can be applied to other MAST query functions.
Notes from my side:
- a change is added to the generic
AstroQuery.from_cache()function, to accommodate MAST response (which is a list ofResponserather than a single one). It might be more desirable to refactor the caching logic so that it can be used independent of AstroQuery, so that MAST-specific behavior can be accommodated more easily without adding to base AstroQuery, or duplicating the caching logic.
I don't yet see why the tests are failing, it looks very unrelated and something I couldn't yet reproduce locally.
Also, I'm travelling this week, but will try to get back to this and do a reasonable review soon.
For failing tests: for now I assume that they are due to some transient issues with CI, and expect they will recover in the next push.
(Just wanted to subscribe and say I am super excited about this PR, and will be happy to help test.)
I'm swamped through Monday, but should have time to test this later next week.
This implements a really useful feature. At the moment I frequently query MAST in order to get the same "search result" as part of my pipeline to iterate through a large list of objects, even though the files themselves are cached. Is there anyway I can help get this merged?
@christinahedges I will try to make time next week to look at this PR.
I am also in general concerned with the lack of an expiration on the cache, which has bitten me before when for example a colleague and I queried Ned repeatedly, not understanding why we were getting different results and it turned out that one of us had made the same query a year prior (according to the date on the cache file) and were still accessing that very old search result. This has also come up within astroquery.mast with Tesscut which does use astroquery caching, resulting in users not finding new results (there are TESS releases ~monthly).
@ceb8 I think this is a really good point. In particular if users query for e.g. TESS data, there will potentially be new targets and files every month. For the most part, I want this functionality so that when I re-run my scripts 5 times a day, I'm not waiting for a result from MAST. If we had an expiration date of e.g. 7 days, that would totally meet my needs.
@christinahedges Yeah, I think ideally the cache will have a default timeout on the short scale, the user will have the ability to set it to a user specified value, and there will always be a no-cache option.
@ceb8 - There is a cascade of issues for caching, so certainly there is a need to get the caching solved for many modules: https://github.com/astropy/astroquery/issues/1511 https://github.com/astropy/astroquery/issues/1531 and even a very old PR, that may or may not be suitable to be thawed up: https://github.com/astropy/astroquery/pull/442
cc @keflavich @jwoillez
@bsipocz I will try to look at the caching situation overall next week then, and gather what needs to be done and make a plan. π
@ceb8 and @bsipocz I really appreciate it! This will be such great functionality, and will hopefully stop me pinging MAST 100 times a day...! Let me know if I can do anything to help!
@bsipocz @christinahedges Opened WIP PR https://github.com/astropy/astroquery/pull/1634 to address the general caching situation. Will look at this PR next.
That is awesome thank you! I can't wait for this functionality! π π π π π π π
per @ceb , this PR will wait until #1634 (cache expiration policy, etc.) completes.
@orionlee @ceb8 - it would be interesting to get this rebased on top of the (likely finished) #1634 as a way to provide some extra testing for that PR.
@bsipocz Excellent Idea, I'll work on this next.
This PR predates a large refactor of the MAST code, so I had to do a manual rebase and force push.
@ceb8 - can you rebase this on top of https://github.com/astropy/astroquery/pull/1634, so to see it works with the refactored machinery?
(then this PR will contain the changes from that one, etc, but we'll sort that out along the way).
Also, this PR is very much considered WIP/draft as part of testing the other one, so please everyone, no detailed code reviews until we sort out the big picture!
Hmm, I think something went bad during the rebase. did you try to do it interactively, and keep only the one commit from this branch, rather than the ones from main, too?
@bsipocz Found some interesting things:
- Only one small tweak (latest commit) required and it's caching as promised (yay)
- Because of the way we determine cache location and the way the MAST module is written, the cache directory is not the name of the top level class the user interacts with, which is definitely a problem. Not sure if this should be solved by the MAST module or if I need to tweak the caching infrastructure. I will explore.
Because of the way we determine cache location and the way the MAST module is written, the cache directory is not the name of the top level class the user interacts with, which is definitely a problem. Not sure if this should be solved by the MAST module or if I need to tweak the caching infrastructure. I will explore.
can you elaborate on this with an example? There are problems with the MAST (and gemini) class naming already, I wonder whether this would be a non-issue if that one is solved or are unrelated.
@bsipocz It's unrelated. Basically MAST hands off querying to a difference class depending on what interface is needed, and that class does the caching, rather than the top level class the user queried.
Example:
from astroquery.mast import Observations
obs_table = Observations.query_criteria(dataproduct_type=["image"],
proposal_pi="Osten*",
s_dec=[43.5,45.5], cache=True)
In this example you would expect the cache location to be Observations.cache_location (.astropy/cache/astroquery/Observations) but it actually caches the file in .astropy/cache/astroquery/PortalAPI because the PortalAPI class is the one that handles this type of query.
In this example you would expect the cache location to be Observations.cache_location (.astropy/cache/astroquery/Observations) but it actually caches the file in .astropy/cache/astroquery/PortalAPI because the PortalAPI class is the one that handles this type of query.
Yeap, this is pretty bad. At least the issue of having two Observations class is known and in the works to fix it with a very long deprecation, it still won't solve the main issue.
Would you think adding one or more layers help? Or maybe the full namespace would be the solution, as we have substructure in a few modules (though caching everything esa together, and everything ipac, and everything solarsystem, etc doesn't feel that wrong as having just one flat layer.
btw, this is exactly the issue that I hoped to be able to recover before merging the other PR.