conda-forge.github.io icon indicating copy to clipboard operation
conda-forge.github.io copied to clipboard

RFC: conda-forge epochs for solver accuracy, speed & debuggability?

Open h-vetinari opened this issue 2 years ago • 21 comments

Conda and mamba's solver take into account the entirety of packages ever published when trying to resolve an environment (with some accelerations, i.e. checking first if things are resolvable with repodata_current.json).

This can lead sometimes lead the solver astray and force it into very weird contortions, where very old packages are picked just because they seemingly satisfy the constraints (though realistically, this is almost always an error in our metadata). There are many examples of this, here's a few that came up recently:

  • #1528
  • #1597
  • https://github.com/pytorch/vision/issues/4665

While this definitely also has some advantages (less rebuilds, old packages stay installable), this also can run into inevitable problems where old packages haven't been rebuilt for modern dependencies (e.g. no run-exports), not aware of unknown-at-the-time ABI breaks, noarch vs. yesarch, etc.

So it would be nice to give users a way to enforce an option that says "I only want comparatively recent packages" or, in other words, "please don't do unexpected/unintended/crazy things while trying to resolve my environment".

I was thinking about how this could be done in a way that wouldn't require constant rebuilds (i.e. say, if a "conda-forge epoch" were to be defined as equal to a calendar year, nothing would be installable in January until all common packages have been rebuilt).

My current idea looks as follows:

  • There's an empty metapackage __conda-forge-epoch that gets built every day (or week, or month), and versioned accordingly, i.e. 2022.12.19.
  • All outputs gain an automatic run-constraint
    run_constrained:
       {% set epoch = datetime.date.today().strftime('%Y.%m.%d') %}
       - __conda-forge-epoch <={{ epoch }}
    
    • note the <=, which is the other way around from e.g. our usual run-exports.
    • implementing this (without having to modify every recipe) probably needs support from conda-build, but for now I'm assuming this is possible.
  • By default, __conda-forge-epoch does not get installed, and therefore the constraints don't get triggered.
    • This also means we wouldn't have to rebuild stuff more often than we already do, as the proposed default is effectively the same as the status quo.
    • In other words, there are no hard "epoch breaks" (like we had once upon a time for going from the old compilers to the new ones).
  • If a user wants to make avoid certain solver errors, or simply enforce recent builds, they can add __conda-forge-epoch>=yyyy.mm.dd to their environment specs (now we have the >=). This would force the solver to only take into account packages built after that date.
  • Perhaps even more importantly, it would allow users (& conda-forge members) to more easily debug solver errors, by forcing the solver to only consider a more recent subset of packages, without getting lost in the weeds of the past.

I think just the debugging capabilities of this would make this worth considering, but maybe I'm just not very good at debugging resolver errors. 😅

Would be interested to hear people's thoughts.

h-vetinari avatar Dec 19 '22 10:12 h-vetinari

Rather than having a metapackage I think this could be an install flag as the build timestamps are already included in the repodata.

chrisburr avatar Dec 19 '22 11:12 chrisburr

I wrote this down in my "plugin ideas" notepad, which seems to be related:

Fetch the unpatched repodata, remove packages that have been uploaded after given date, and apply patches existing in the relevant git repo at given date. The whole thing could be added as a conda subcommand, like:

$ conda fetch-repodata -c channel \
[--patched | --unpatched] \
[--date YYYY-MM-DD-HH-MM-SS UTC] \
[--output repodata.json] \
[--subdirs all|linux-64,noarch]

Or even as a create flag:

$ conda create -c conda-channel [--channels-as-of YYYY-MM-DD-HH-MM-SS UTC]

jaimergp avatar Dec 21 '22 11:12 jaimergp

(you might have realised but to be explicit) I think the idea of @h-vetinari is the opposite, any packages published before a given time should be ignored. I can see the utility in being able to do both.

chrisburr avatar Dec 21 '22 13:12 chrisburr

I think it is important to not ignore packages that didn't need rebuilding for a long time.

So say I want to ignore packages older than Jan 1st, 2022. If the latest build of a package is Nov 2021, I may want to include it in the solve.

hmaarrfk avatar Dec 21 '22 13:12 hmaarrfk

In that case I think the lack of package would be clear enough to point the user to use an older date. Trying to be clever would reduce the debuggability benifits and is probably has no correct solution.

chrisburr avatar Dec 21 '22 13:12 chrisburr

@ericdill and/or @jakirkham do we have some Google docs notes from AnacondaCon2018 where we discussed this? Looks like the same points are being raised and thought over here and maybe that doc could help.

For some context: We did consider this a long time ago. What came out of that disucssion is the current repo data and the metachannel. The former is implemented and working in conda. The latter was abandoned b/c of mamba's faster solver. However, I can imagine that we could have use for it in the future. Specially in CIs.

PS: I also recall that @chenghlee considered something along those lines for defaults. Not sure if that evolved into something or not.

ocefpaf avatar Dec 21 '22 17:12 ocefpaf

if we do it might be in a hackmd or maybe in the conda forge google drive? If it’s anywhere else I’m not sure I’d be able to track it down

ericdill avatar Dec 21 '22 17:12 ericdill

@CJ-Wright do you remember anything about this meeting or accompanying notes?

ericdill avatar Dec 21 '22 18:12 ericdill

I think it is important to not ignore packages that didn't need rebuilding for a long time.

This is in fact explicitly what I'd like to be able to do (not by default of course). Packages that haven't been rebuilt in a while are often subtly incompatible (compare the recent libxml2 issues), and figuring out which feedstocks among a given set of dependencies haven't been rebuilt in a while is a useful tool for chasing down resolver errors.

h-vetinari avatar Dec 21 '22 23:12 h-vetinari

See also the various CFEP ideas on pinning epochs and LTS. The pinning epochs in particular would solve a variant of the issues raised here but is tied to major changes in our global build pins.

beckermr avatar Jan 04 '23 14:01 beckermr

As I recall from the last time we discussed something like this (believe a previous AnacondaCON as Filipe suggested), one concern raised is that any package that is very infrequently rebuilt could end up getting stripped out, which would result in unsolvable environments. This is the same concern as Mark's comment above. Not sure how we would fix this, but it is something we should be aware of.

My recollection is similar to Filipe's. Namely the underlying issues were addressed by adding repodata patching, the CDN, solver improvements made by mamba (which conda is adopting), and conda-metachannel. Though conda-metachannel was dropped, but maybe it could be baked into repos.

Would be interested to know why these don't work for the issues above.

jakirkham avatar Jan 04 '23 18:01 jakirkham

one concern raised is that any package that is very infrequently rebuilt could end up getting stripped out, which would result in unsolvable environments. This is the same concern as Mark's comment above. Not sure how we would fix this, but it is something we should be aware of.

That's a feature, not a bug (though again, not intended to be on by default; rather for debugging purposes). If you try to solve with packages (say) <12months old, and you get <infrequently_rebuilt_pkg> not found, you could go check out that feedstock and rerender it if necessary.

Would be interested to know why these don't work for the issues above.

In a nutshell: most packages have grown constraints over time (conda-forge becoming more consistent on run-exports, virtual packages, MACOS_DEPLOYMENT_TARGET, etc.), and where there's a conflict, the solver might sometimes reach back far in time, where things are seemingly solvable due to the lack of those newer constraints.

It's almost always erroneous metadata (but who wants to write metadata patches for packages >3 years old...?), and so it would be much simpler to just ignore very old builds as an option on the solver side.

h-vetinari avatar Jan 04 '23 21:01 h-vetinari

We can implement this system using repodata patches entirely FWIW since the timestamps are in the repodata and we can add constrains entries.

beckermr avatar Jan 04 '23 22:01 beckermr

We can implement this system using repodata patches entirely FWIW since the timestamps are in the repodata and we can add constrains entries.

I don't understand how that would work (repodata patches are static and not optional, what I'm proposing allows using a user-defined timestamp as cut-off, and is opt-in). Could you explain in a bit more detail?

h-vetinari avatar Jan 04 '23 22:01 h-vetinari

We just add constrains like you said above using the time stamp in the package (when it was built). We rebuild our repodata patches once a week right now since not all of them are static. So we'd get updates or packages no more than a week old which is basically fine.

beckermr avatar Jan 04 '23 23:01 beckermr

We have to make the meta package too of course. We don't need changes to conda build though.

beckermr avatar Jan 04 '23 23:01 beckermr

We could have different repodata filenames, right? e.g. repodata-from-202210.json or repodata-until-202210.json, and then it can be chosen with --repodata-fn. This is more of a Anaconda.org / conda-index feature though.

jaimergp avatar Jan 05 '23 15:01 jaimergp

Yeah this would go beyond and actually cache historical repodata, which we do not do. We really need to separate the concerns/ideas here.

beckermr avatar Jan 05 '23 15:01 beckermr

Or on the client side, either:

  • via a conda plugin implementing different index filtering constraints
  • by augmenting the MatchSpec syntax for timestamp operators (and not just string globbing), so you can do *[timestamp>=123456789].

(roughly)

jaimergp avatar Jan 05 '23 15:01 jaimergp

Yeah this would go beyond and actually cache historical repodata

I don't intend those to be historical artifacts, but views of the full repodata, filtered by timestamp in different ways.

jaimergp avatar Jan 05 '23 15:01 jaimergp

Isn't it already possible for someone to download our patched repodata package and use that (if they wish)?

In terms of a metapackage could we just use the repodata patch package? Or do we prefer this to be something else?

jakirkham avatar Jan 05 '23 19:01 jakirkham