stac-spec icon indicating copy to clipboard operation
stac-spec copied to clipboard

Archive (ZIP, 7z, rar, ...) extension

Open m-mohr opened this issue 4 years ago • 16 comments

Taken from https://github.com/radiantearth/stac-spec/issues/921#issuecomment-746398428:

What I thought would be useful to add something like an "archive" extension, which allows to specify individual files from archives in assets, e.g. linking to a xml and tiff in a zip so that you can use for example eo:bands to describe the tiff in the archive. Something like archive:path could describe the path of the file in the archive (e.g. ZIP) and allows direct mapping and extraction. May also be useful for Zarr or so?

@DanielJDufour agreed in a private message:

I saw your comments here about a possible "archive" extension [..] and just wanted to say that I think it would be a great idea. I often see data producers who don't have any experience with web technologies releasing files as zip files and it could be a good opportunity for software devs to create STAC catalogs of the metadata without having to host the actual files.

m-mohr avatar Jan 27 '21 16:01 m-mohr

I'm very much in favour of an archive extension. We're currently experimenting with using STAC to automate data management (downloading) between server and client applications. We need the client to check if files have already been downloaded and extracted, so it needs to know the contents of the archive. We added the contents and target_folder objects to aid in this:

"assets": {
        "analytic": {
            "href": "https://my.server.com/data/arctic/2020_37_258.zip",
            "roles": ["data"],
            "type": "application/zip",
            "contents": [
                {
                    "name": "2020_37_258.cnt"
                }
            ],
            "target_folder": "./data/artic/"
        },

brentfraser avatar Feb 03 '21 15:02 brentfraser

@brentfraser What exactly is target_folder referring to? I don't understand the purpose... is it the folder to extract to? If yes, isn't that a client option? It seems to not belong into a STAC file. The folder given may not even be writable on my machine, for example.

Just for future reference, here's an example of how I imagined it to be, e.g. for a ZIP file containing an image (data/image.tiff) and a metadata file (meta/iso.xml):

{
  ...
  "assets": {
    "analytic": {
      "href": "https://my.server.com/data/category/example.zip",
      "roles": [
        "data",
        "archive"
      ],
      "type": "application/zip",
      "eo:bands": [
        ...
      ],
      "archive:href": "data/image.tif",
      "archive:type": "image/tiff"
    },
    "metadata": {
      "href": "https://my.server.com/data/category/example.zip",
      "roles": [
        "metadata",
        "archive"
      ],
      "type": "application/zip",
      "archive:href": "meta/iso.xml",
      "archive:type": "application/xml"
    }
  }
}

m-mohr avatar Feb 03 '21 15:02 m-mohr

The use case is for automatic downloading by the client, so we added the target_folder as a hint as to where to save the zip file. We could have just had the client app decide, but this adds a little bit of flexibility for the future. It's very application-specific so I wouldn't expect it to be added to the STAC spec.

Since the archive could have more than one file, I added the contents as a list of objects. Using your example as a guide we could have:

        "archive:contents": [
            {
                "href": "2020_37_258.cnt",
                "type: "application/octet-stream"
            },
            {
                "href": "2020_37_258meta.txt",
                "type: "text/plain"
            },

brentfraser avatar Feb 03 '21 15:02 brentfraser

@brentfraser Yeah, my examples lists the archive twice (i.e. as often as there are assets). I guess that's a matter of how we understand assets. It could either list the archive once or multiple times. The latter is easier for validation and the behavior of adding specific things like eo:bands to assets works better (I updated my example above). I guess my proposal would also work in your case, right?

m-mohr avatar Feb 03 '21 16:02 m-mohr

@m-mohr Ah I missed that the archive name was the same for both of the contained assets. I see the wisdom of your approach in that it allows for using all of the STAC spec asset objects to describe the content. The little bit of duplication in specifing the href/roles/type and the overhead of the client comparing the href is worth the flexibility when multiple files are in the archive. Thanks!

brentfraser avatar Feb 03 '21 16:02 brentfraser

An archive extension would be useful for the work we're doing on Radiant MLHub, as well. We are generating gzipped tar archives all assets associated with our collections to make it easier for users to download a dataset in its entirety.

I'm thinking that one way we might use this would be in combination with the Collection-level assets introduced in #800. We would reference the tar archive as an asset on the Collection, but then Items would also have assets that point to the location of those resources within the main archive.

Because our collections can sometimes contain hundreds of thousands of items, I think the implementation that @m-mohr is suggesting would work better for us than the implementation that @brentfraser suggests. A single "contents" list on the collection would become too unwieldy, but having a way to reference an archive within each Item would be reasonable.

duckontheweb avatar Feb 03 '21 18:02 duckontheweb

@m-mohr Just to clarify, the archive:href and archive:type properties would be optional, yes? When referring to the archive itself as an asset it seems like neither of those would be relevant.

duckontheweb avatar Feb 03 '21 21:02 duckontheweb

Yes, I think so. Would anything else be needed except href and type?

m-mohr avatar Feb 03 '21 21:02 m-mohr

It would probably be useful for us to have something like an archive:size property that gives the total byte size of the archive. This might only be relevant on the main archive asset, though.

For the type, it would be good to be able to represent both the type of archive and any compression that was applied. I'm not sure this is possible with MIME types. I know we could use application/gzip to indicate the compression, but I don't think there's anything to indicate something like a tar archive.

duckontheweb avatar Feb 03 '21 22:02 duckontheweb

It would probably be useful for us to have something like an archive:size property that gives the total byte size of the archive.

We have recently added file:size, which would achieve exactly that, I guess. Although one could argue that it's the size of the file in the archive. On the other hand it's counter intuitive if archive:size refers to the archive size while all other archive:... fields refer to the file in the archive.

For the type, it would be good to be able to represent both the type of archive and any compression that was applied. I'm not sure this is possible with MIME types.

I don't think that is possible with media types.

I know we could use application/gzip to indicate the compression, but I don't think there's anything to indicate something like a tar archive.

Yeah, that it's not just one level could be an issue.

One could think of nesting it so that the first level describes the gzip and the second level describes the tar. But that seems a bit over the top. Maybe we just make archive:type and archive:href an array so that it can go through it after each other:

href: 'example.tar.gz',
type: 'application/gzip',
archive:type: ['application/tar', 'image/tiff'],
archive:href: ['example.tar', 'image.tiff']

m-mohr avatar Feb 03 '21 22:02 m-mohr

What would you all think about adding properties to indicate the byte range for the asset inside a zip file? For example, if you had a zip for 4 different crop types (corn, wheat, barley, rice):

{
    "assets": {
        "corn": {
            "href": "https://my.server.com/data/category/example.zip",
            "roles": [
                "data",
                "archive"
            ],
            "type": "application/zip",
            "archive:href": "data/corn.tif",
            "archive:type": "image/tiff",
            "archive:start": 0,
            "archive:end": 73242
        }
    },
        "wheat": {
            "href": "https://my.server.com/data/category/example.zip",
            "roles": [
                "data",
                "archive"
            ],
            "type": "application/zip",
            "archive:href": "data/wheat.tif",
            "archive:type": "image/tiff",
            "archive:start": 73243,
            "archive:end": 132021
        }
    }
}

We could also consider, something like: "archive:range": "bytes=73243-132021" to mimic HTTP Range Requests (https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests).

Here's some complications:

  • The client may need to assume other relevant metadata found in the zip file's central directory.
  • Not all archive formats concatenate files like zip, so this wouldn't apply in many cases.

Here's some benefits:

  • Clients could automatically grab the assets from a zip file without reading the header first
  • You could calculate the file size quite easily (but this is probably covered by the fileinfo extension...)

I'm not strongly opinionated about it, but just curious to hear your thoughts. We could also consider rolling this into a zip-specific extension. Does STAC have a concept of sub-extensions like "archive:zip:start": 73243?

DanielJDufour avatar Feb 04 '21 00:02 DanielJDufour

It would probably be useful for us to have something like an archive:size property that gives the total byte size of the archive.

We have recently added file:size, which would achieve exactly that, I guess. Although one could argue that it's the size of the file in the archive. On the other hand it's counter intuitive if archive:size refers to the archive size while all other archive:... fields refer to the file in the archive.

I agree that the scope of archive:size would be confusing since the other archive:* properties refer to files in the archive. I think file:size would be clear enough in the context of the other properties. In the absence of archive:href and archive:type properties we would infer that the asset (and the file:size property) refers to the entire archive. If those properties are present, we assume it refers to an individual member of the archive.

duckontheweb avatar Feb 04 '21 01:02 duckontheweb

I know we could use application/gzip to indicate the compression, but I don't think there's anything to indicate something like a tar archive.

Yeah, that it's not just one level could be an issue.

One could think of nesting it so that the first level describes the gzip and the second level describes the tar. But that seems a bit over the top. Maybe we just make archive:type and archive:href an array so that it can go through it after each other:

href: 'example.tar.gz',
type: 'application/gzip',
archive:type: ['application/tar', 'image/tiff'],
archive:href: ['example.tar', 'image.tiff']

This would work for members within the archive, but we would also want to have a property that is present for the archive asset itself that contains this info. In the case where an asset refers to the entire tar archive it seems like we wouldn't be using the archive:* properties, so the asset would look something like this...

"archive": {
      "href": "example.tar.gz",
      "roles": [
        "metadata",
        "archive"
      ],
      "type": "application/gzip",
}

...and it would be nice to have something in there that indicates how a client should handle the archive.

I agree that nesting it seems a bit heavy-handed. Here are a few thoughts on how we might handle it:

  • Allow type to be a list and do something similar to what you suggest for archive:type. In this case, our archive might look like:

    {
      ...
      "type": ["application/gzip", "application/x-tar"]
    ...
    }
    
  • Add an archive:format property that is defined at the archive level to handle this info. Then the archive asset might be:

     "archive": {
       "href": "example.tar.gz",
       "roles": [
         "metadata",
         "archive"
       ],
       "type": "application/gzip",
       "archive:format": "application/x-tar"
    }
    

    and a member of the archive might look like this:

    "image": {
      "href": "example.tar.gz",
      "type": "application/gzip",
      "archive:format": "application/x-tar", 
      "archive:type": "image/tiff",
      "archive:href": "data/image.tiff"
    }
    

    As you suggested for archive:size above, this could be counterintuitive if the other archive:* properties apply to archive members, but if it's clearly documented then it might be okay.

duckontheweb avatar Feb 04 '21 01:02 duckontheweb

We could also consider rolling this into a zip-specific extension. Does STAC have a concept of sub-extensions like "archive:zip:start": 73243?

Yes, I think I'd prefer a separate extension as it's very specific. We don't have anything like that where you can have two extension prefixes, but you can simply add zip:start and inherit from archive in the JSON Schema of the ZIP extension.

I agree that the scope of archive:size would be confusing since the other archive:* properties refer to files in the archive.

The way I'd design the extension is that it's fully backward compatible for clients that do not implement the archive extension. Thus all common fields must refer to the archive file. For example, file:size refers to the archive file size. On the other hand, the properties like file:bits_per_sample don't make much sense any longer on the top level. Other properties work better, like eo:bands, which seems also reasonable to have at this level and just tells that there are those bands in the zip file. Looking at the extensions, the only extensions that are somewhat problematic seem to be file and timestamps. We need to clearly document how to use them in combination with archive and on the same hand make sure it's compatible with clients not implementing archive.

@duckontheweb Honestly, I don't understand your second comment (https://github.com/radiantearth/stac-spec/issues/956#issuecomment-772955132) . I don't understand the difference between my proposal for "two-level" archives?! We won't be able to change the type to arrays though.

m-mohr avatar Feb 04 '21 10:02 m-mohr

@duckontheweb Honestly, I don't understand your second comment (#956 (comment)) . I don't understand the difference between my proposal for "two-level" archives?! We won't be able to change the type to arrays though.

Yeah, sorry for the rambling reply on that one... The only difference is that in your proposal the archive:type property would have one value that applies to the archive (application/tar) and one value that applies to the member within the archive (image/tiff). I was proposing that we separate those so that archive:format only ever refers to the archive itself. archive/type and archive/href would only be present for assets that refer to members within the archive. I'm open to pushback on this, though...

duckontheweb avatar Feb 09 '21 15:02 duckontheweb

What would you all think about adding properties to indicate the byte range for the asset inside a zip file?

I'd be +1 on this as long as it's optional. But clearly not all archive formats support random access to a single file.

kylebarron avatar Feb 09 '21 16:02 kylebarron

The extension now lives at https://github.com/constantinius/archive/

m-mohr avatar Apr 04 '23 16:04 m-mohr