data-subscriber icon indicating copy to clipboard operation
data-subscriber copied to clipboard

Downloader ignores all but four common data file extensions.

Open jjmcnelis opened this issue 2 years ago • 4 comments

This behavior is confusing to S-MODE users, who were expecting the data-downloader commands given on dataset landing pages to work out-of-the-box, e.g. https://podaac.jpl.nasa.gov/dataset/SMODE_LX_SHIPBOARD_ADCP_V1#capability-modal-download

S-MODE ADCP data files have .nc4 extension, so they're ignored by data-downloader by default. One must use the -e .nc4 option to download the files. That isn't mentioned in the modal on the dataset landing page.

Is there any reason not to download every file listed in the UMM-G RelatedUrls with Subtype="GET DATA"? (That change would render the file extension option obsolete; maybe it could be repurposed to ignore by extension instead.)

@jjmcnelis @ScienceCat18 @torimcd provided the following guidance to T. Farrar and C. Rocha on 2022-11-09:

podaac-data-downloader ignores files with .nc4 extension by default. This command works:

podaac-data-downloader -c SMODE_LX_SHIPBOARD_ADCP_V1 -d ./data/SMODE_LX_SHIPBOARD_ADCP_V1 --start-date 2021-08-01T00:00:00Z --end-date 2022-11-01T00:00:00Z -e .nc4

The defaults are .nc, .h5, .zip, .tar.gz, according to: https://github.com/podaac/data-subscriber/blob/main/Downloader.md

jjmcnelis avatar Nov 10 '22 21:11 jjmcnelis

regression across multiple collection to see what unintended downloads will look like

mike-gangl avatar Jan 19 '23 18:01 mike-gangl

I have been thinking about this, the current subscriber behavior is to add all GET DATA and EXTENDED METADATA from CMR into a single list and then iterate and filter by extension.

the change to extension would 'break' this and we'd begin downloading metadata by default (one can argue this is a good thing, but it's a change to the current system so far).

my thought process:

  • is add another flag --include-metadata (default false) . if specified, download the "EXTENDED METADATA" types.
  • make the change to only applying extensions if they are provided.

The combination of these 2 options will 1- get users the data more easily and 2 not change the existing behavior of not including checksums (in the default cases).

mike-gangl avatar Mar 30 '23 13:03 mike-gangl

ok, after starting this work i ran into some other issues that will require some more rework internal to the subscriber.

We essentially get all the granule results and then make several lists of items, checksums, data files, metadata files, start/stop dates and cycles. The problem is with some granules that have multiple data files and ones where, if you remove the "suffix", you don't get the native-id (e.g. S6 granules with a .nc and a bufr.bin data file). We need to refactor this into "downloadable" objects that encapsulate all of the required data for a download: cycle, start time, checksum, etc so that we don't have to do this complex lookup. I'm going to move this to a post 2.0 fix so that we focus on 1.13, then the harmony integration, but this should really be done soon as it will be cleaner for adding additional features in the future.

mike-gangl avatar Apr 11 '23 22:04 mike-gangl

Here's another data point, we got this feedback from a user:

Might suggest no defaults when the -e option is not used. For me, no -e option means search for files with any extension, so one case see all available files. I Could see users wanting to do this especially with the –dry-run option.

skorper avatar May 04 '23 23:05 skorper