specs Data Package Bundling (and maybe compression)

Updated: 2016-11-17

We want a way to "bundle" a data package into a single file for transmission. In addition it may be compressed at the same time.

Note also that individual resources can be compressed in themselves - see #290

Desired Features

Widely supported in client systems
Ability to access data within the bundle easily and without downloading the entire bundle (e.g. to stream resources from the bundle)

Original Description

As other packaging types use compression for distributing each package (JAR is a ZIP archive), there should be a section proposing a way to deal with compressed data packages.

Jun 18 '14 10:06 sabas

@sabas do you have a specific suggestion? I think you are right this is useful.

/cc @paulfitz

May 26 '15 13:05 rufuspollock

I was thinking a specification which would tell how to intepret a zipped package on the fly, in the same way a JAR is executed by Java. So I could expect:

the compression algorithm: gzip?
which files are needed for correct decompression or reading on the fly (like the zcat and similar cli tools)
how to compress the datapackage
which file extension or MIME type to use

May 26 '15 13:05 sabas

@sabas i think this makes a lot of sense. Do you want to start speccing something out?

May 26 '15 14:05 rufuspollock

See #198

Jun 01 '15 09:06 sabas

There was a lot of discussion in the PR. The PR basically suggested tar + gzip. Subsequent discussion in the PR suggested reviewing existing best practice more and using zip. Main excerpts:

@mfenner wrote:

In the spirit of keeping things simple I wouldn't provide two options (.dp and .dpz). And in the spirit of not reinventing the wheel I would look at https://researchobject.github.io/specifications/bundle/, which uses Universal Container Format (UCF). Or for a software packacking example Chrome extensions: https://developer.chrome.com/extensions/packaging

Excerpt from Research Object bundle spec:

A UCF container is based on the ZIP compression file format [ZIP], enforcing additional restrictions. The most important restrictions are:

Reserved filenames in the root directory: mimetype and META-INF

Filenames must be encoded in UTF-8

Compression must be Uncompressed or Flate

may use Zip64 extensions, but should only do so when required

The first file must be the uncompressed mimetype and without any extra attributes

UCF says about mimetype:

The first file in the Zip container must be a file with the ASCII name of mimetype, which holds the MIME type for the Zip container (application/epub+zip as an ASCII string; no padding, white-space, or case change).

@tfmorris wrote:

I'll second @mfenner 's suggestion to exhaust all possible existing alternatives before defining a new format. If you are forced to define something new, I'd strongly consider using zip instead of tar, since every other container format in the world from JAR to EPUB to Research Object Bundle has settled on it. There's an old overview of a bunch of the zip-based formats here: http://broadcast.oreilly.com/2009/01/packaging-formats-of-famous-ap.html

Nov 19 '15 21:11 rufuspollock

@mfenner would you be interested in taking a bit of editorship here? You were a strong proponent of introducing this (and I'm +1 too). In addition, this should be very simple and short spec to write once we decide what to do.

Nov 19 '15 21:11 rufuspollock

Let me think about how to approach this.

Nov 19 '15 21:11 mfenner

@mfenner any further thoughts? /cc @danfowler

I am increasingly thinking that "bundling" a data package into one file (compressed) is an important use case and would love your suggestions here.

Feb 03 '16 14:02 rufuspollock

@rgrp sorry for not following up on this. I want a standard zip compression, and hadn't found the time to spec out the details.

Bundling a data package into one file is an important use case for me.

Feb 03 '16 15:02 mfenner

For reference (although not directly related to a spec for compression) we went ahead and added zip support to the recently upgraded Python lib for DataPackage, based on very clear use cases in the CKAN integration, and, in general, that it is sensible and reasonable :). @vitorbaptista developed and led on that initiative.

For reference:

https://github.com/okfn/datapackage-py/blob/master/datapackage/datapackage.py#L231
https://github.com/okfn/datapackage-py/blob/master/datapackage/datapackage.py#L277

Feb 03 '16 18:02 pwalsh

@mfenner i imagine this can be super simple. Would you be able to start a draft and drop it in an issue here?

@vitorbaptista useful to get outline of what you did.

Feb 04 '16 09:02 rufuspollock

The requirements for my ZIP file loading were to be able to load both ZIPs that follow the pattern:

./datapackage.json ./data/resource.csv

and also

./my-datapackage/datapackage.json ./my-datapackage/data/resource.csv

This is because we wanted to support the ZIP files generated by GitHub (i.e. https://github.com/datasets/gdp/archive/master.zip), which have all contents inside a folder.

The actual code checks that the ZIP file has only and only one datapackage.json file and loads it. All paths in the datapackage are then relative to the datapackage.json, then. This allows any folder structure inside the ZIP file, as long as there is a single datapackage.json. It was easier to code this way :+1:

Feb 04 '16 17:02 vitorbaptista

+1 Makes a lot of sense.

Feb 04 '16 18:02 mfenner

I just hope you awere of ZIP filename encoding problems: http://marcosc.com/2008/12/zip-files-and-encoding-i-hate-you/

Lot of users still stick to windows-1251 (cyrillic) or SHIFT_JIS (japanese).

Maybe it would be good idea to pick archive format that doesn't have such desing flaw (if such format exists)?

Feb 04 '16 23:02 demyanrogozhin

That blog post is from 2008, is barely coherent, and seems focused more on the tools than the format.

What do you recommend instead of ZIP?

Feb 05 '16 04:02 tfmorris

@mfenner are you happy to draft a mini-spec here? I imagine it could be just a few paragraphs saying e.g.

We use zip
datapackage.json must be at "base" of the zip
any issues about "referencing" within the zip
zip file naming conventions (if any)

Feb 05 '16 10:02 rufuspollock

I wouldn't limit it to datapackage.json only at the base of the zip for the reasons I mentioned before (https://github.com/dataprotocols/dataprotocols/issues/132#issuecomment-179949441). I would suggest we either:

Support datapackage.json either at the base of the zip or in a top-level folder;
Support datapackage.json only at a top-level folder (i.e. the contents of the ZIP must be inside a single folder);
Support datapackage.json anywhere inside the ZIP.

I would suggest us to follow the 3rd option, as it's both easier to code and to explain.

Feb 05 '16 12:02 vitorbaptista

I think is better to be explicit in this case and limit the options for people. A single datapackage.json at the base of the zip or in a top level folder is easy enough to understand and to code, so my vote goes to 1

Feb 05 '16 14:02 amercader

Option 1 would enforce the rules used by the datasets datapackages,

Feb 05 '16 14:02 sabas

@tfmorris I propose 7zip as its open-source, provide better compression ration and UTF-8 file-names.

Despite 2008 is far away, problems with i18n in filesystems is the same - ZIP file created on PC with Korean locale and contain Korean in filenames will be unreadable gibberish after unZIPing on PC with different locale. ZIP allows usage of different encoding for filenames, but doesn't contain information about original locale. It's less about format, but about tools. But still problem exists.

Feb 06 '16 13:02 demyanrogozhin

For reference, BagIt's serialization specification work doesn't actually mandate a given format, just rules for (de)serializing behavior:

Several rules govern the serialization of a bag and apply equally to all types of archive files

https://tools.ietf.org/html/draft-kunze-bagit-13#section-4

Feb 15 '16 15:02 danfowler

@mfenner are you still interested to work on a mini spec for this?

Jul 12 '16 09:07 pwalsh

Having read the BagIt approach I think they got it pretty much right.

My only question would be about step 3 - we could have instead that you do it in the datapackage directory so that the datapackage.json is at the root of the archive file. However, my guess is that bagit creators thought about this.

Next steps:

Create a data-package-identifier draft
Port the BagIt approach in there with appropriate tweaking (fulsomely acknowledging BagIt)
Publish - suggest this is an extension rather than a core spec

Serialization

   In some scenarios, it may be convenient to serialize the bag's
   filesystem hierarchy (i.e., the base directory) into a single-file
   archive format such as TAR or ZIP (the serialization) and then later
   deserialize the serialization to recreate the filesystem hierarchy.
   Several rules govern the serialization of a bag and apply equally to
   all types of archive files:

   1.  The top-level directory of a serialization MUST contain only one
       bag.

   2.  The serialization SHOULD have the same name as the bag's base
       directory, but MUST have an extension added to identify the
       format.  For example, the receiver of "mybag.tar.gz" expects the
       corresponding base directory to be created as "mybag".

   3.  A bag MUST NOT be serialized from within its base directory, but
       from the parent of the base directory (where the base directory
       appears as an entry).  Thus, after a bag is deserialized in an
       empty directory, a listing of that directory shows exactly one
       entry.  For example, deserializing "mybag.zip" in an empty
       directory causes the creation of the base directory "mybag" and,
       beneath "mybag", the creation of all payload and tag files.

   4.  The deserialization of a bag MUST produce a single base directory
       bag with the top-level structure as described in this
       specification without requiring any additional un-archiving step.
       For example, after one un-archiving step it would be an error for
       the "data/" directory to appear as "data.tar.gz".  TAR and ZIP
       files may appear inside the payload beneath the "data/"
       directory, where they would be treated as any other payload file.

   When serializing a bag, care must be taken to ensure that the archive
   format's restrictions on file naming, such as allowable characters,
   length, or character encoding, will support the requirements of the
   systems on which it will be used.  See Section 7.2.

Sep 28 '16 05:09 rufuspollock

@rufuspollock will you work on some wording for this? Maybe better in here until I finish on #337

Dec 21 '16 12:12 pwalsh

@pwalsh yes - note this is a patterns item at this stage. It won't be part of the spec atm i think.

Dec 21 '16 13:12 rufuspollock

tar + zstd are great for this purpose.

Zstd is superior to gzip/zlib.

Tools exist and are available on permissive license (BSD).

C:\msys64\usr\bin\bsdtar.exe -a -cf - --format pax <files> -C . | zstd.exe - -19 -o R:\data.tar.zst

related topic https://github.com/frictionlessdata/specs/issues/290#issue-176224908

Feb 05 '17 12:02 lowrece12