Data Package Bundling (and maybe compression)
Updated: 2016-11-17
We want a way to "bundle" a data package into a single file for transmission. In addition it may be compressed at the same time.
Note also that individual resources can be compressed in themselves - see #290
Desired Features
- Widely supported in client systems
- Ability to access data within the bundle easily and without downloading the entire bundle (e.g. to stream resources from the bundle)
Original Description
As other packaging types use compression for distributing each package (JAR is a ZIP archive), there should be a section proposing a way to deal with compressed data packages.
@sabas do you have a specific suggestion? I think you are right this is useful.
/cc @paulfitz
I was thinking a specification which would tell how to intepret a zipped package on the fly, in the same way a JAR is executed by Java. So I could expect:
- the compression algorithm: gzip?
- which files are needed for correct decompression or reading on the fly (like the zcat and similar cli tools)
- how to compress the datapackage
- which file extension or MIME type to use
@sabas i think this makes a lot of sense. Do you want to start speccing something out?
See #198
There was a lot of discussion in the PR. The PR basically suggested tar + gzip. Subsequent discussion in the PR suggested reviewing existing best practice more and using zip. Main excerpts:
@mfenner wrote:
In the spirit of keeping things simple I wouldn't provide two options (.dp and .dpz). And in the spirit of not reinventing the wheel I would look at https://researchobject.github.io/specifications/bundle/, which uses Universal Container Format (UCF). Or for a software packacking example Chrome extensions: https://developer.chrome.com/extensions/packaging
Excerpt from Research Object bundle spec:
A UCF container is based on the ZIP compression file format [ZIP], enforcing additional restrictions. The most important restrictions are:
- Reserved filenames in the root directory: mimetype and META-INF
- Filenames must be encoded in UTF-8
- Compression must be Uncompressed or Flate
- may use Zip64 extensions, but should only do so when required
- The first file must be the uncompressed mimetype and without any extra attributes
UCF says about mimetype:
The first file in the Zip container must be a file with the ASCII name of mimetype, which holds the MIME type for the Zip container (application/epub+zip as an ASCII string; no padding, white-space, or case change).
@tfmorris wrote:
I'll second @mfenner 's suggestion to exhaust all possible existing alternatives before defining a new format. If you are forced to define something new, I'd strongly consider using zip instead of tar, since every other container format in the world from JAR to EPUB to Research Object Bundle has settled on it. There's an old overview of a bunch of the zip-based formats here: http://broadcast.oreilly.com/2009/01/packaging-formats-of-famous-ap.html
@mfenner would you be interested in taking a bit of editorship here? You were a strong proponent of introducing this (and I'm +1 too). In addition, this should be very simple and short spec to write once we decide what to do.
Let me think about how to approach this.
@mfenner any further thoughts? /cc @danfowler
I am increasingly thinking that "bundling" a data package into one file (compressed) is an important use case and would love your suggestions here.
@rgrp sorry for not following up on this. I want a standard zip compression, and hadn't found the time to spec out the details.
Bundling a data package into one file is an important use case for me.
For reference (although not directly related to a spec for compression) we went ahead and added zip support to the recently upgraded Python lib for DataPackage, based on very clear use cases in the CKAN integration, and, in general, that it is sensible and reasonable :). @vitorbaptista developed and led on that initiative.
For reference:
- https://github.com/okfn/datapackage-py/blob/master/datapackage/datapackage.py#L231
- https://github.com/okfn/datapackage-py/blob/master/datapackage/datapackage.py#L277
@mfenner i imagine this can be super simple. Would you be able to start a draft and drop it in an issue here?
@vitorbaptista useful to get outline of what you did.
The requirements for my ZIP file loading were to be able to load both ZIPs that follow the pattern:
./datapackage.json ./data/resource.csv
and also
./my-datapackage/datapackage.json ./my-datapackage/data/resource.csv
This is because we wanted to support the ZIP files generated by GitHub (i.e. https://github.com/datasets/gdp/archive/master.zip), which have all contents inside a folder.
The actual code checks that the ZIP file has only and only one datapackage.json file and loads it. All paths in the datapackage are then relative to the datapackage.json, then. This allows any folder structure inside the ZIP file, as long as there is a single datapackage.json. It was easier to code this way :+1:
+1 Makes a lot of sense.
I just hope you awere of ZIP filename encoding problems: http://marcosc.com/2008/12/zip-files-and-encoding-i-hate-you/
Lot of users still stick to windows-1251 (cyrillic) or SHIFT_JIS (japanese).
Maybe it would be good idea to pick archive format that doesn't have such desing flaw (if such format exists)?
That blog post is from 2008, is barely coherent, and seems focused more on the tools than the format.
What do you recommend instead of ZIP?
@mfenner are you happy to draft a mini-spec here? I imagine it could be just a few paragraphs saying e.g.
- We use zip
- datapackage.json must be at "base" of the zip
- any issues about "referencing" within the zip
- zip file naming conventions (if any)
I wouldn't limit it to datapackage.json only at the base of the zip for the reasons I mentioned before (https://github.com/dataprotocols/dataprotocols/issues/132#issuecomment-179949441). I would suggest we either:
- Support
datapackage.jsoneither at the base of the zip or in a top-level folder; - Support
datapackage.jsononly at a top-level folder (i.e. the contents of the ZIP must be inside a single folder); - Support
datapackage.jsonanywhere inside the ZIP.
I would suggest us to follow the 3rd option, as it's both easier to code and to explain.
I think is better to be explicit in this case and limit the options for people. A single datapackage.json at the base of the zip or in a top level folder is easy enough to understand and to code, so my vote goes to 1
Option 1 would enforce the rules used by the datasets datapackages,
@tfmorris I propose 7zip as its open-source, provide better compression ration and UTF-8 file-names.
Despite 2008 is far away, problems with i18n in filesystems is the same - ZIP file created on PC with Korean locale and contain Korean in filenames will be unreadable gibberish after unZIPing on PC with different locale. ZIP allows usage of different encoding for filenames, but doesn't contain information about original locale. It's less about format, but about tools. But still problem exists.
For reference, BagIt's serialization specification work doesn't actually mandate a given format, just rules for (de)serializing behavior:
Several rules govern the serialization of a bag and apply equally to all types of archive files
https://tools.ietf.org/html/draft-kunze-bagit-13#section-4
@mfenner are you still interested to work on a mini spec for this?
Having read the BagIt approach I think they got it pretty much right.
My only question would be about step 3 - we could have instead that you do it in the datapackage directory so that the datapackage.json is at the root of the archive file. However, my guess is that bagit creators thought about this.
Next steps:
- Create a data-package-identifier draft
- Port the BagIt approach in there with appropriate tweaking (fulsomely acknowledging BagIt)
- Publish - suggest this is an extension rather than a core spec
Serialization
In some scenarios, it may be convenient to serialize the bag's
filesystem hierarchy (i.e., the base directory) into a single-file
archive format such as TAR or ZIP (the serialization) and then later
deserialize the serialization to recreate the filesystem hierarchy.
Several rules govern the serialization of a bag and apply equally to
all types of archive files:
1. The top-level directory of a serialization MUST contain only one
bag.
2. The serialization SHOULD have the same name as the bag's base
directory, but MUST have an extension added to identify the
format. For example, the receiver of "mybag.tar.gz" expects the
corresponding base directory to be created as "mybag".
3. A bag MUST NOT be serialized from within its base directory, but
from the parent of the base directory (where the base directory
appears as an entry). Thus, after a bag is deserialized in an
empty directory, a listing of that directory shows exactly one
entry. For example, deserializing "mybag.zip" in an empty
directory causes the creation of the base directory "mybag" and,
beneath "mybag", the creation of all payload and tag files.
4. The deserialization of a bag MUST produce a single base directory
bag with the top-level structure as described in this
specification without requiring any additional un-archiving step.
For example, after one un-archiving step it would be an error for
the "data/" directory to appear as "data.tar.gz". TAR and ZIP
files may appear inside the payload beneath the "data/"
directory, where they would be treated as any other payload file.
When serializing a bag, care must be taken to ensure that the archive
format's restrictions on file naming, such as allowable characters,
length, or character encoding, will support the requirements of the
systems on which it will be used. See Section 7.2.
@rufuspollock will you work on some wording for this? Maybe better in here until I finish on #337
@pwalsh yes - note this is a patterns item at this stage. It won't be part of the spec atm i think.
tar + zstd are great for this purpose.
Zstd is superior to gzip/zlib.
Tools exist and are available on permissive license (BSD).
C:\msys64\usr\bin\bsdtar.exe -a -cf - --format pax <files> -C . | zstd.exe - -19 -o R:\data.tar.zst
related topic https://github.com/frictionlessdata/specs/issues/290#issue-176224908