ceps icon indicating copy to clipboard operation
ceps copied to clipboard

Draft CEP for `.conda` package format

Open jakirkham opened this issue 3 years ago • 8 comments

It would be good to have a CEP that spells out what is in the .conda format as this is missing atm. Especially as we increasingly rely on this and depend on a few tools to manage reading and writing these. Currently the info we have, which could be used for this CEP is...

  • https://docs.google.com/document/d/1HGKsbg_j69rKXPihhpCb1kNQSE8Iy3yOsUU2x68x8uw
  • https://docs.conda.io/projects/conda/en/stable/user-guide/concepts/packages.html#conda-file-format
  • https://www.anaconda.com/blog/understanding-and-improving-condas-performance

Would be good to pull this together to provide a single point of truth.

Independently there are some things that we might want to consider to amend the specification like generating/reusing a Zstandard dictionary for faster and more compact compression/decompression and have per file format dictionaries (text files may benefit a lot from this for example).

jakirkham avatar Nov 18 '22 22:11 jakirkham

It'd be nice to also get this page updated: https://docs.conda.io/projects/conda-build/en/latest/resources/package-spec.html

leofang avatar Nov 20 '22 02:11 leofang

Would suggest raising a new conda-build doc issue

jakirkham avatar Nov 21 '22 20:11 jakirkham

So .conda packages are ZIP-format containers with a metadata.json file containing just the version number, and then an info and pkg file that are always .tar.zst even though some earlier documentation hoped to support "any libarchive filter". The order of metadata, info and pkg inside the ZIP does not matter.

Put together the pkg- and info- tarballs have exactly the same contents as old-format .tar.bz2 conda packages. Generally the info/ subdirectory of a .tar.bz2 package goes into the info- tarball of a .conda.

conda-package-handling uses a list of regular expressions to determine which files go into info/, but this list excludes some files that obviously belong in info/ - for example info/LICENSE vs info/LICENSE.txt. We should audit the existing packages to see whether we can drop this behavior and simply include info/ wholesale. Do packages include significant application data in info/ (besides test data, which is already intentionally in info/)?

A regular conda install unpacks both inner .tar.zst and does not use the "easy to inspect just the metadata" feature provided by the info/pkg split. This is still good, because zst is much, much faster to extract compared to bz2.

We might want to standardize whether info- or pkg- gets extracted first, or enforce that one cannot overwrite the other (that no filename appears in both inner tarballs).

Separate from the .conda container is the shared question of what the metadata looks like. This probably has to be a different, longer document.

dholth avatar Nov 22 '22 22:11 dholth

Forget where this was discussed atm, but recall one point of confusion was whether conda_pkg_format_version should be an int or a str. Would be nice to resolve this as part of this work

jakirkham avatar Feb 15 '23 18:02 jakirkham

We might want to standardize whether info- or pkg- gets extracted first, or enforce that one cannot overwrite the other (that no filename appears in both inner tarballs).

Yea, clobbered files in info/ (i.e. package overwrites conda metadata) should be prevented with an error by conda-build (and alike) before the artifact is generated.

jaimergp avatar May 04 '23 11:05 jaimergp

I don't think the normal way of creating .conda can create clobbered files. It takes a list of filenames and categorizes them into two groups. The check would need to be on extraction.

dholth avatar May 04 '23 13:05 dholth

No, but conda-build can infer which files have gotten into info/ and flag those that would result in a clobber error, I think?

jaimergp avatar May 04 '23 13:05 jaimergp