pixi icon indicating copy to clipboard operation
pixi copied to clipboard

More sized-efficient pixi.lock (with care to be optimized for git packfile).

Open jleibs opened this issue 1 year ago • 7 comments

Problem description

The pixi.lock file in our repository is now over 1 megabyte (https://github.com/rerun-io/rerun/blob/main/pixi.lock).

It still compresses reasonably in git object storage within the packfile (taking up about 3MB of storage across history), but it is fast-becoming a meaningful contributor to repository growth.

This is a tricky one to do something about, as we ultimately care more about contribution to the delta-compressed packfile than we care about the actual size of the file-on-disk. Compression strategies that make the file smaller in a single checkout but harder to compress would still be a net-negative.

jleibs avatar Jun 14 '24 14:06 jleibs

Do you have an idea on how we could change the format to achieve this?

I think one of the biggest issues is the presence of sha hashes because those compress terribly. We tried to minimize the places where these occur for that reason.

baszalmstra avatar Jun 14 '24 14:06 baszalmstra

Not off the top of my head -- I definitely acknowledge it's a hard problem, so I was at least somewhat relieved to find that it still compressed reasonably well in the packfile.

Looking at the file itself it seems like there is still maybe a lot of meta information that is redundant with package management meta-data from conda/pypi as well.

From an information theory perspective does the lockfile need to have more than a table of: (feature, platform, package-name, version-number)

All the information about the kind of package, where to find it, it's own transitive dependencies, etc. seem like they could be re-computed from the pixi.toml file again.

Maybe there need to be two files here? One is a strict minimal .lock that can be included in the repository, while the other is a materialized file that can be cached somewhere.

jleibs avatar Jun 14 '24 15:06 jleibs

There has been some talk in the python packaging space along these lines, where the lock-file would still need to be rendered into a suitable format. Could be food for thought :)

tdejager avatar Jun 18 '24 11:06 tdejager

We are facing a bit the same issue when we intend to use pixi to manage our CI/CD.

From an information theory perspective does the lockfile need to have more than a table of: (feature, platform, package-name, version-number)

In scikit-learn, when doing the flow for which we would like to use pixi, we are currently storing file from conda-lock files (eg https://github.com/scikit-learn/scikit-learn/blob/main/build_tools/azure/pylatest_conda_forge_mkl_osx-64_conda.lock) that have less meta-information.

In skrub, we went with pixi and automatic weekly update of some dependencies bring a significant amount of changes.

So the "environment" part of the pixi.lock + the checksum I think would be almost enough in this particular setting of CIs.

I clearly can understand that in some other cases, you might need more information to have a fully and secure reproducible environment.

Maybe there is a way to set the cursor for the underlying use-case.

seem like they could be re-computed from the pixi.toml file again

If this is something possible, this could be one of the trade-off where less information are stored at a small cost of recomputing some potential information when the CI is triggered.

glemaitre avatar Jun 27 '24 10:06 glemaitre

Like discussed in the discord. I'm also curious what we can do if we deduplicate some of the common prefixes.

tdejager avatar Jun 28 '24 07:06 tdejager

Not off the top of my head -- I definitely acknowledge it's a hard problem, so I was at least somewhat relieved to find that it still compressed reasonably well in the packfile.

BTW @jleibs what (commands) are you using to compare in the packfile?

tdejager avatar Jul 05 '24 13:07 tdejager

Is there any progress on this? I see the removal of md5 patch was merged into rattler and then reverted.

Another improvement would be to avoid the repetition of the full URLs for each package. Instead of

    packages:
      linux-64:
      - conda: https://conda.anaconda.org/conda-forge/linux-64/_openmp_mutex-4.5-6_kmp_llvm.conda
      - conda: https://conda.anaconda.org/conda-forge/noarch/_python_abi3_support-1.0-hd8ed1ab_2.conda
      - conda: https://conda.anaconda.org/conda-forge/noarch/adwaita-icon-theme-49.0-unix_0.conda
      - conda: https://conda.anaconda.org/conda-forge/noarch/aiobotocore-2.25.2-pyhcf101f3_0.conda

why not (the base URL is already specified in the channel)

    packages:
      linux-64:
      - conda:
        - conda-forge: 
          - linux-64/_openmp_mutex-4.5-6_kmp_llvm.conda
          - noarch/_python_abi3_support-1.0-hd8ed1ab_2.conda
          - noarch/adwaita-icon-theme-49.0-unix_0.conda
          - noarch/aiobotocore-2.25.2-pyhcf101f3_0.conda

(I know repetitive text compresses fairly well, but it is still redundant information)

I also still don't understand why all the package metadata needs to be in the lock file (depends, licence, purls, size, ...).

samtygier-stfc avatar Nov 13 '25 11:11 samtygier-stfc