astroquery icon indicating copy to clipboard operation
astroquery copied to clipboard

Not overwriting MAST download of incomplete file

Open eas342 opened this issue 8 months ago • 9 comments
trafficstars

I believe astroquery.mast should be able to download and write over an incomplete file. However, it still is corrupted/incomplete even when the code recognizes the file as incomplete. Here are the steps to re-produce.

from astroquery.mast import Observations
from astropy.io import fits
fileN = 'jw01185103001_02102_00001-seg001_nrcalong_rate.fits'
with open(fileN,'wb') as f:
    f.write(b'junk')
Observations.download_file('mast:jwst/product/'+fileN)
tmpHDU = fits.open(fileN)

I get OSError: Empty or corrupt FITS file whereas I was expecting astroquery to download a new file because it detects that the cached one is incomplete.

I'm using astroquery 0.4.10.dev9927 on Python 3.12.2

eas342 avatar Feb 27 '25 22:02 eas342

I was expecting astroquery to download a new file because it detects that the cached one is incomplete.

That's not how "resume download" generally works. In pretty much every http client (like wget) and also what astroquery seems to do ( https://github.com/astropy/astroquery/blob/main/astroquery/query.py#L455 ) "resume" will download only the "missing" part of the file and append it to already existing file. This way if you have 10GB file and your download crashed at last 1MB you only need to download 1MB and not whole file again.

Your "test" does not showcase a normal scenario because you put junk into the file, instead of actual prefix of the file.

Pharisaeus avatar Mar 02 '25 12:03 Pharisaeus

Note that if I run the code snippet above, Astroquery says it is downloading the file (100.00%) and shows 83k in the progress bar, but when I look at the file, it still says 0B.

WARNING: Found cached file jw01185103001_02102_00001-seg001_nrcalong_rate.fits with size 4 that is different from expected size 83520 [astroquery.query]
Downloading URL https://mast.stsci.edu/api/v0.1/Download/file?uri=mast:jwst/product/jw01185103001_02102_00001-seg001_nrcalong_rate.fits to jw01185103001_02102_00001-seg001_nrcalong_rate.fits ...
|========================================================================|  83k/ 83k (100.00%)         0s

Sorry I wasn't more clear with my example. Note that MAST observations is hard-coded as continuation=False https://github.com/astropy/astroquery/blob/72952b7408400dd55bfa82e3424a4ecbc4f9c4c1/astroquery/mast/observations.py#L645 so I don't think it can download the missing parts of incomplete files anyway?

Context: I have been trying to download observations from many programs that are sometimes large ~10 GB consisting of several dozen files. I have been getting incomplete read errors, presumably due to some network or server issue. Then, if I try to repeat the download of all the products, the file is left as is. I could set cache=False, but then would have to download everything again and the download error might happen on a different file and I'd have to start on the 10GB again. A manual solution is to figure out which one failed and delete it or run that with cache=False but it'd be nice to have an automatic solution by re-running the full download with cache=True. Because it is an intermittent problem, I can't make an easy reproducible code snippet so that's why I give my "junk" file example.

If my example corrupted file is a bad one, perhaps there is a way to force an error while in mid-download from MAST? Or could I make a copy of the file and only write out part of it?

In any case, this tweak to Astroquery query.py seems to fix the problem, because it actually over-writes my "junk" file with a complete FITS file: https://github.com/astropy/astroquery/pull/3232

eas342 avatar Mar 03 '25 19:03 eas342

If my example corrupted file is a bad one, perhaps there is a way to force an error while in mid-download from MAST? Or could I make a copy of the file and only write out part of it?

I believe this is what you wanted to showcase:

import os

from astroquery.mast import Observations
from astropy.io import fits

fileN = 'jw01185103001_02102_00001-seg001_nrcalong_rate.fits'
os.remove(fileN)
Observations.download_file('mast:jwst/product/' + fileN)
with open(fileN, 'r+') as f:
    f.seek(1000)
    f.truncate()
Observations.download_file('mast:jwst/product/' + fileN)
tmpHDU = fits.open(fileN)

We have the first 1000 bytes of the file, we run the download again, it detects that the file is incomplete, but instead of downloading the rest, it just results in an empty file.

Pharisaeus avatar Mar 03 '25 20:03 Pharisaeus

Yes, that's better example, thanks!

eas342 avatar Mar 03 '25 20:03 eas342

Thanks @eas342 for opening this issue, and thanks @Pharisaeus for your helpful comments! The hard-coding of continuation=False predates my time at MAST, so I'm not entirely sure why it was set that way. When I remove that condition from the _download_file calls and let it default to True, things to seem to work as expected and the tests pass.

@scfleming @dr-rodriguez Is there any particular reason that that we would NOT want incomplete file downloads to continue where they left off?

snbianco avatar Mar 04 '25 18:03 snbianco

cc @ceb8

bsipocz avatar Mar 04 '25 18:03 bsipocz

Yeah @ceb8 would have the historical perspective. There's nothing I know of, unless for reasons I would not understand this is important for how the local caching works or used to work.

scfleming avatar Mar 04 '25 19:03 scfleming

I have no knowledge of MAST data system, but just to throw some ideas why attempts at continuation might be "wrong", depending on how this is implemented:

  • The size of the file might not always be known upfront - this might happen when there is some stream process applied on top of the file (for example re/compression, cutouts, header correction). In such case the Content-Length header would not be present in the response, not sure how this would be handled, but I guess it should result in downloading a fresh copy of the whole file.
  • The content/size of the file might not be the same over time, so an attempt at appending the "missing part" might result in a corrupted file (eg. re-processing of the data and substituting the old version with a new one)
  • The backend of the data system might not support Range header / handles it incorrectly (eg. ignores the header and sends back the whole file again)

Pharisaeus avatar Mar 04 '25 21:03 Pharisaeus

Hmmmmm, yes I vaugely remember this, I think it was not always knowing the file size up front, or that sometimes the file size reported was wrong. But I can't remember the exact scenarios.

ceb8 avatar Mar 18 '25 15:03 ceb8