singularity-hpc icon indicating copy to clipboard operation
singularity-hpc copied to clipboard

Brainstorming aliases registry or associated tool

Open vsoch opened this issue 2 years ago • 17 comments

From discussion in #557

It would be fun to brainstorm an idea that I’ve been toying with - I really like this idea of providing container transparency (wrt) aliases - it was my initial vision for the scientific filesystem (building the aliases into the entry point) and the interactions here. What I’ve been chewing on is how to move these aliases outside of shpc, meaning making them possible for any container technology to find and use. It could mean something like:

-a custom container build that allows for customization or automation detection and writing of aliases

  • Saving into some simple metadata file
  • Using Oras OCI registry as storage to upload as a related artifact
  • A custom tool that can find the container and alias file, and “enable” it (and I’m not sure what that means yet)

At the most complex level you could imagine a registry just for aliases, and a tool like shpc would ping it to look up containers. And the simplest level is a command line tool that tries to bring them together in some context.

vsoch avatar Jun 28 '22 05:06 vsoch

Haven't forgotten, will get here! :-)

marcodelapierre avatar Jul 21 '22 01:07 marcodelapierre

I really like this idea of providing container transparency (wrt) aliases

Finally getting to write something here!

So...this is actually the original reason why ended up on Singularity HPC!! :-D Of course, I get what you mean, with SHPC we need to manually write down the aliases for each recipe; if this could be automated, or provided elsewhere, then most maintenance of SHPC could be automated (just pair an automatic scanning of tags in container repos, which it already does for registered ones).

A couple of items in the space of sourcing aliases for packages.

  1. For us at Pawsey the largest chunk of container (in terms of numbers) is bioinformatics, for which there is the Biocontainers project, which is a layer on top of Bioconda. So I went...wouldn't it be nice if the Bioconda/Biocontainers folks had a requirement where all contributed packages also need to provide a manifest (maybe not the best name) of aliases for that package? That could then be used in many ways, not only for SHPC/containers, but really also for characterising the package within Bioconda or other online package registries. Also note that Bioconda/Biocontainers is backed inside Elixir, the European initiative for Bioinformatics. That was last year, I gave it a shot at pinging the Bioconda folks on this idea, but didn't seem to much interested.

  2. From what I wrote, you can tell that I like your idea of having a registry of aliases, because at that point it can be used by multiple services. However...

  3. Is there a way to automate the process of generating alias lists for containers? I think that would be the cool enabler I think, that would enable a rapid, significant growth in scope for the alias registry, and indirectly for SHPC, too. No idea nor capacity to investigate such a tool at the moment... Last week I found this on Twitter, though, and I was wondering whether there is any overlap with this automation idea: https://github.com/anchore/syft

That's it for now!

marcodelapierre avatar Aug 25 '22 08:08 marcodelapierre

I have seen syft! It wouldn't work for shpc to run it, but perhaps there could be some kind of action that generates recipes for the user based on input metadata... going to give this a shot soon :)

vsoch avatar Aug 25 '22 23:08 vsoch

okay I tried out syft! It looks like it's only going to do searches for package managers, so (in other words) miss a large majority of custom executables. But we have progress! Per #568 I created a new singularityhub/guts set of actions, and the first action loads a container, dumps the filesystem and configs, and then searches paths for binaries. This was @alecbcs brilliant idea! And then we can create a set of base container "manifests" with these executables to use to create a diff against! E.g., start with an ubuntu base, and add some stuff - subtract the ubuntu base to get the stuff. I haven't done the latter, but here are the guts for the bases. https://github.com/singularityhub/shpc-guts that use the action: https://github.com/singularityhub/shpc-guts/blob/main/.github/workflows/generate.yaml :partying_face:

vsoch avatar Aug 26 '22 04:08 vsoch

Wow cool! - I will need to have a look :-)

Question: when you say it searches for package managers..I guess this includes conda? That might be massive, as the Biocontainers collection (1000s of containers) are all built on top of Bioconda packages. Conda itself gets stripped out of the container, though, so not sure whether syft can still spring into action.

https://biocontainers.pro https://github.com/BioContainers/containers

marcodelapierre avatar Aug 26 '22 04:08 marcodelapierre

I don’t see conda directly in syft. https://github.com/anchore/syft#supported-ecosystems So I don’t think it’s going to work for our use case. It saw nothing I expected/wanted in spack containers! But the good thing about the guts action that I just made is that it is agnostic to package managers. Anything you put in a PATH will be seen! I need maybe the weekend to put together some more final examples of the metadata we will see - the base images are a bit noisy! If you want to give me some containers you are interested in when I do that I’ll make sure to generate for them and update here when I do.

vsoch avatar Aug 26 '22 04:08 vsoch

Thanks for the clarification! I think it would be great testing with a couple of the containers that are in the SHPC registry, in quay.io/biocontainers/. I think most of them have the same underlying OS, so your approach might scale very well there (although you always get some noise, related to package dependencies such as Perl)

marcodelapierre avatar Aug 26 '22 04:08 marcodelapierre

Gotcha - will add those to my mental list!

vsoch avatar Aug 26 '22 04:08 vsoch

okay - next step is complete - I refactored the action into a library proper https://singularityhub.github.io/guts/.

Will go back to the bases next, and generating a "diff" command for the guts library to derive special executables!

vsoch avatar Aug 27 '22 05:08 vsoch

This sounds amazing! I hope I find the time to give it a look and try soon, for feedback and contribs! :-)

marcodelapierre avatar Aug 27 '22 09:08 marcodelapierre

Forgot to share - here is a quick glimpse of what I found for biocontainers samtools:

image

It seems to work pretty well? That top list should be unique executables (to that container) on a PATH.

vsoch avatar Aug 29 '22 22:08 vsoch

okay have a few examples - it's super simple but I think it works pretty well? Will try at an action / some integration with shpc to generate the module automatically.

https://github.com/singularityhub/guts/tree/main/examples/test_diff

I'm thinking maybe just an action on a registry where you can use a workflow to write a container, and then have it generate a PR with the container.yaml for you.

vsoch avatar Aug 29 '22 23:08 vsoch

That looks really promising, wow!

One thing to think about is assumptions/choices to handle binary list vs available versions...

  • overrides for each version may be an overshoot
  • oldest version might miss executables (as the case above, see the yaml currently in the shpc registry)
  • newest version may reduce back-compatibility too much
  • a sweet spot...?

Another one is having a blacklist internal to Guts, with executables never to be added: for instance from the above, conda, linux stuff such as tic/toe/tabs, binaries in /sbin

I am taking some days off (I have exaggerated with work), I will be back with some more comments.

marcodelapierre avatar Aug 30 '22 01:08 marcodelapierre

okay sounds good! I'll make a stupid simple action with shpc-registry to get us started and we can discuss further when you are back. I'm actually thinking that if we can automate this, maybe by default there should be an overrides file generated per version, so we always match versions and aliases exactly. And by default when we generate, we just add the latest version. The one caveat is updating - the updater isn't expecting to be writing new overrides files. That would require a bit of tweaking to our current updater, so it won't be a quick fix (but I think it's possible and would be nice strategy to take!)

Going to have dinner, will report back if/when I have this first example!

vsoch avatar Aug 30 '22 01:08 vsoch

And I'm thinking of this in scope of shpc, but I just reminded myself that this original idea, to be able to have a service that tells you container important binaries, is still cool / worth pursuing, if there is a good way to go about it!

vsoch avatar Aug 30 '22 01:08 vsoch

okay - it's a start! https://github.com/singularityhub/shpc-registry/pull/10 and screenshots tweet: https://twitter.com/vsoch/status/1564458755789467648

I still don't know the answer to the last question, I sort of know intuitively it's useful but I can't write down the compelling use case yet. It's just something that should be available for all containers, imho.

vsoch avatar Aug 30 '22 03:08 vsoch

As far as I could see, it'd be fine limiting all quay.io/biocointainers recipes to the content of /usr/local/bin/. I don't think there's anything in /usr/bin or /usr/sbin that's worth exposing

muffato avatar Aug 30 '22 13:08 muffato

Tried this on a biocontainer of ours: https://github.com/muffato/shpc-registry/pull/1/files Only goat-cli: /usr/local/bin/goat-cli really needs to be there. Everything else is essentially dependencies that can be skipped. @vsoch: I guess there will always have to be some manual editing ? How much automated filtering can be done ?

muffato avatar Sep 10 '22 11:09 muffato

The way it's derived is via a diff - so it likely won't be perfect for what the person wants (and you can tweak the PR to delete lines you don't want). There isn't currently a mechanism to do automated filtering, as my expectation was the user wouldn't remember what they want /what's in the container) but let me know if you have ideas. We could indeed add some kind of filter.

vsoch avatar Sep 10 '22 17:09 vsoch

okay - I'm happy with this for now - it met my use case of needing a list of aliases (automatically) from a container, and it can be provided as a service to the extent that someone requests one or forks the repository and runs the workflow themself. Closing here, thanks for the discussion everyone!

vsoch avatar Oct 08 '22 18:10 vsoch