stac-spec
stac-spec copied to clipboard
Archive (ZIP, 7z, rar, ...) extension
Taken from https://github.com/radiantearth/stac-spec/issues/921#issuecomment-746398428:
What I thought would be useful to add something like an "archive" extension, which allows to specify individual files from archives in assets, e.g. linking to a xml and tiff in a zip so that you can use for example eo:bands to describe the tiff in the archive. Something like archive:path could describe the path of the file in the archive (e.g. ZIP) and allows direct mapping and extraction. May also be useful for Zarr or so?
@DanielJDufour agreed in a private message:
I saw your comments here about a possible "archive" extension [..] and just wanted to say that I think it would be a great idea. I often see data producers who don't have any experience with web technologies releasing files as zip files and it could be a good opportunity for software devs to create STAC catalogs of the metadata without having to host the actual files.
I'm very much in favour of an archive extension. We're currently experimenting with using STAC to automate data management (downloading) between server and client applications. We need the client to check if files have already been downloaded and extracted, so it needs to know the contents of the archive. We added the contents
and target_folder
objects to aid in this:
"assets": {
"analytic": {
"href": "https://my.server.com/data/arctic/2020_37_258.zip",
"roles": ["data"],
"type": "application/zip",
"contents": [
{
"name": "2020_37_258.cnt"
}
],
"target_folder": "./data/artic/"
},
@brentfraser What exactly is target_folder referring to? I don't understand the purpose... is it the folder to extract to? If yes, isn't that a client option? It seems to not belong into a STAC file. The folder given may not even be writable on my machine, for example.
Just for future reference, here's an example of how I imagined it to be, e.g. for a ZIP file containing an image (data/image.tiff) and a metadata file (meta/iso.xml):
{
...
"assets": {
"analytic": {
"href": "https://my.server.com/data/category/example.zip",
"roles": [
"data",
"archive"
],
"type": "application/zip",
"eo:bands": [
...
],
"archive:href": "data/image.tif",
"archive:type": "image/tiff"
},
"metadata": {
"href": "https://my.server.com/data/category/example.zip",
"roles": [
"metadata",
"archive"
],
"type": "application/zip",
"archive:href": "meta/iso.xml",
"archive:type": "application/xml"
}
}
}
The use case is for automatic downloading by the client, so we added the target_folder as a hint as to where to save the zip file. We could have just had the client app decide, but this adds a little bit of flexibility for the future. It's very application-specific so I wouldn't expect it to be added to the STAC spec.
Since the archive could have more than one file, I added the contents
as a list of objects. Using your example as a guide we could have:
"archive:contents": [
{
"href": "2020_37_258.cnt",
"type: "application/octet-stream"
},
{
"href": "2020_37_258meta.txt",
"type: "text/plain"
},
@brentfraser Yeah, my examples lists the archive twice (i.e. as often as there are assets). I guess that's a matter of how we understand assets. It could either list the archive once or multiple times. The latter is easier for validation and the behavior of adding specific things like eo:bands to assets works better (I updated my example above). I guess my proposal would also work in your case, right?
@m-mohr Ah I missed that the archive name was the same for both of the contained assets. I see the wisdom of your approach in that it allows for using all of the STAC spec asset objects to describe the content. The little bit of duplication in specifing the href/roles/type and the overhead of the client comparing the href is worth the flexibility when multiple files are in the archive. Thanks!
An archive extension would be useful for the work we're doing on Radiant MLHub, as well. We are generating gzipped tar archives all assets associated with our collections to make it easier for users to download a dataset in its entirety.
I'm thinking that one way we might use this would be in combination with the Collection-level assets introduced in #800. We would reference the tar archive as an asset on the Collection, but then Items would also have assets that point to the location of those resources within the main archive.
Because our collections can sometimes contain hundreds of thousands of items, I think the implementation that @m-mohr is suggesting would work better for us than the implementation that @brentfraser suggests. A single "contents"
list on the collection would become too unwieldy, but having a way to reference an archive within each Item would be reasonable.
@m-mohr Just to clarify, the archive:href
and archive:type
properties would be optional, yes? When referring to the archive itself as an asset it seems like neither of those would be relevant.
Yes, I think so. Would anything else be needed except href and type?
It would probably be useful for us to have something like an archive:size
property that gives the total byte size of the archive. This might only be relevant on the main archive asset, though.
For the type, it would be good to be able to represent both the type of archive and any compression that was applied. I'm not sure this is possible with MIME types. I know we could use application/gzip
to indicate the compression, but I don't think there's anything to indicate something like a tar archive.
It would probably be useful for us to have something like an
archive:size
property that gives the total byte size of the archive.
We have recently added file:size
, which would achieve exactly that, I guess. Although one could argue that it's the size of the file in the archive. On the other hand it's counter intuitive if archive:size refers to the archive size while all other archive:... fields refer to the file in the archive.
For the type, it would be good to be able to represent both the type of archive and any compression that was applied. I'm not sure this is possible with MIME types.
I don't think that is possible with media types.
I know we could use
application/gzip
to indicate the compression, but I don't think there's anything to indicate something like a tar archive.
Yeah, that it's not just one level could be an issue.
One could think of nesting it so that the first level describes the gzip and the second level describes the tar. But that seems a bit over the top. Maybe we just make archive:type and archive:href an array so that it can go through it after each other:
href: 'example.tar.gz',
type: 'application/gzip',
archive:type: ['application/tar', 'image/tiff'],
archive:href: ['example.tar', 'image.tiff']
What would you all think about adding properties to indicate the byte range for the asset inside a zip file? For example, if you had a zip for 4 different crop types (corn, wheat, barley, rice):
{
"assets": {
"corn": {
"href": "https://my.server.com/data/category/example.zip",
"roles": [
"data",
"archive"
],
"type": "application/zip",
"archive:href": "data/corn.tif",
"archive:type": "image/tiff",
"archive:start": 0,
"archive:end": 73242
}
},
"wheat": {
"href": "https://my.server.com/data/category/example.zip",
"roles": [
"data",
"archive"
],
"type": "application/zip",
"archive:href": "data/wheat.tif",
"archive:type": "image/tiff",
"archive:start": 73243,
"archive:end": 132021
}
}
}
We could also consider, something like: "archive:range": "bytes=73243-132021"
to mimic HTTP Range Requests (https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests).
Here's some complications:
- The client may need to assume other relevant metadata found in the zip file's central directory.
- Not all archive formats concatenate files like zip, so this wouldn't apply in many cases.
Here's some benefits:
- Clients could automatically grab the assets from a zip file without reading the header first
- You could calculate the file size quite easily (but this is probably covered by the fileinfo extension...)
I'm not strongly opinionated about it, but just curious to hear your thoughts. We could also consider rolling this into a zip-specific extension. Does STAC have a concept of sub-extensions like "archive:zip:start": 73243
?
It would probably be useful for us to have something like an
archive:size
property that gives the total byte size of the archive.We have recently added
file:size
, which would achieve exactly that, I guess. Although one could argue that it's the size of the file in the archive. On the other hand it's counter intuitive if archive:size refers to the archive size while all other archive:... fields refer to the file in the archive.
I agree that the scope of archive:size
would be confusing since the other archive:*
properties refer to files in the archive. I think file:size
would be clear enough in the context of the other properties. In the absence of archive:href
and archive:type
properties we would infer that the asset (and the file:size
property) refers to the entire archive. If those properties are present, we assume it refers to an individual member of the archive.
I know we could use
application/gzip
to indicate the compression, but I don't think there's anything to indicate something like a tar archive.Yeah, that it's not just one level could be an issue.
One could think of nesting it so that the first level describes the gzip and the second level describes the tar. But that seems a bit over the top. Maybe we just make archive:type and archive:href an array so that it can go through it after each other:
href: 'example.tar.gz', type: 'application/gzip', archive:type: ['application/tar', 'image/tiff'], archive:href: ['example.tar', 'image.tiff']
This would work for members within the archive, but we would also want to have a property that is present for the archive asset itself that contains this info. In the case where an asset refers to the entire tar archive it seems like we wouldn't be using the archive:*
properties, so the asset would look something like this...
"archive": {
"href": "example.tar.gz",
"roles": [
"metadata",
"archive"
],
"type": "application/gzip",
}
...and it would be nice to have something in there that indicates how a client should handle the archive.
I agree that nesting it seems a bit heavy-handed. Here are a few thoughts on how we might handle it:
-
Allow
type
to be a list and do something similar to what you suggest forarchive:type
. In this case, our archive might look like:{ ... "type": ["application/gzip", "application/x-tar"] ... }
-
Add an
archive:format
property that is defined at the archive level to handle this info. Then the archive asset might be:"archive": { "href": "example.tar.gz", "roles": [ "metadata", "archive" ], "type": "application/gzip", "archive:format": "application/x-tar" }
and a member of the archive might look like this:
"image": { "href": "example.tar.gz", "type": "application/gzip", "archive:format": "application/x-tar", "archive:type": "image/tiff", "archive:href": "data/image.tiff" }
As you suggested for
archive:size
above, this could be counterintuitive if the otherarchive:*
properties apply to archive members, but if it's clearly documented then it might be okay.
We could also consider rolling this into a zip-specific extension. Does STAC have a concept of sub-extensions like
"archive:zip:start": 73243
?
Yes, I think I'd prefer a separate extension as it's very specific. We don't have anything like that where you can have two extension prefixes, but you can simply add zip:start
and inherit from archive in the JSON Schema of the ZIP extension.
I agree that the scope of
archive:size
would be confusing since the otherarchive:*
properties refer to files in the archive.
The way I'd design the extension is that it's fully backward compatible for clients that do not implement the archive extension. Thus all common fields must refer to the archive file. For example, file:size refers to the archive file size. On the other hand, the properties like file:bits_per_sample don't make much sense any longer on the top level. Other properties work better, like eo:bands, which seems also reasonable to have at this level and just tells that there are those bands in the zip file. Looking at the extensions, the only extensions that are somewhat problematic seem to be file and timestamps. We need to clearly document how to use them in combination with archive and on the same hand make sure it's compatible with clients not implementing archive.
@duckontheweb Honestly, I don't understand your second comment (https://github.com/radiantearth/stac-spec/issues/956#issuecomment-772955132) . I don't understand the difference between my proposal for "two-level" archives?! We won't be able to change the type
to arrays though.
@duckontheweb Honestly, I don't understand your second comment (#956 (comment)) . I don't understand the difference between my proposal for "two-level" archives?! We won't be able to change the
type
to arrays though.
Yeah, sorry for the rambling reply on that one... The only difference is that in your proposal the archive:type
property would have one value that applies to the archive (application/tar
) and one value that applies to the member within the archive (image/tiff
). I was proposing that we separate those so that archive:format
only ever refers to the archive itself. archive/type
and archive/href
would only be present for assets that refer to members within the archive. I'm open to pushback on this, though...
What would you all think about adding properties to indicate the byte range for the asset inside a zip file?
I'd be +1 on this as long as it's optional. But clearly not all archive formats support random access to a single file.
The extension now lives at https://github.com/constantinius/archive/