pulp_rpm icon indicating copy to clipboard operation
pulp_rpm copied to clipboard

Improve modulemds* creation API

Open pedro-psb opened this issue 1 year ago • 9 comments

Problem

The API for creating modulemds, modulemds_defaults and modulemds_obsoletes is a bit redundant:

  • There is a set of fields which we want to keep track in the db (e.g, module, stream, profile, etc)
  • There is a snippet field (a yaml file) that "should" contain the same data as the set of fields above. (thats not checked)

This allows creating inconsistent records of this type of content.

Proposal

Some ideas (not mutual exclusive) for a more ergonomic design are:

  1. Add only a snippet as required, plus repository and packages (for modulemds) as optional. The additional data that Pulp want to keep track of should be present in the snippet, or we won't accept it.
  2. For modulemds: If packages are provided (which are pulp_hrefs), validate that they strictly match the rpm->artifacts in the yaml.

The snippet may be a string or a file, I'm not sure what is best.

Additional Context

Motivated by https://github.com/pulp/pulp_rpm/issues/3427

pedro-psb avatar Apr 19 '24 14:04 pedro-psb

After further discussion w/ rpm team, we've concluded that:

  • source of truth: All the data should be derived from the snippet (including packages), as the snippet is what is going to be used in the generation of the metadata (publication step). This is to avoid inconsistent data state.
  • pulp object linking (Modulemds):
    • The current packages parameter of the API, which today can be filled with package hrefs, should be found using information on the snippet. The snippet provides packages nevras (through the name), and thats going to be used internally to look up for that NEVRA in a given repository.
    • If the packages can't be found, Pulp should raise an error

Open questions

When should the attempt to link snippet packages with Pulp Packages be performed?

  • immediate (on upload processing):
    • This is the most sane choice for maintanability/predictability of Pulp.
    • Downside is that build systems should adjust their workflow to ensure that modulemds are uploaded as a last stage. How inconvenient is that for a complex build system?
  • lazy:
    • This would allow the build-system workflow of throwing packages and modulemds in any order an only try the linking when it matters (e.g, publish or using the Modify API w/ copy/remove).
    • General downside is that it adds more complexity to Pulp.
    • Variants:
      • on-publish: When publishing, try to link the packages. Possibly would be on using modify API aswell.
      • on-first-access: On the first attempt to access Modulemd.packages
      • on-explicit-link-request: Have an endpoint to trigger the linking
      • on-repo-associaton: I guess we shouldnt, but we can have "repositoryless" content. In that case, we could try looking in the repository when its associated to a Repo, which may or may not be on upload.

Where should Modulemd look for packages?

  • On the repository its in (latest RepoVer). If its not in a repository, then what?
  • On global set of packages available

pedro-psb avatar May 09 '24 18:05 pedro-psb

@daviddavis Does microsoft have any interest in or need for uploading their own modules directly into Pulp (without a sync)? If so, do you have any issues with the current modulemd creation API and do you have any feedback on what it should look like?

dralley avatar May 16 '24 15:05 dralley

No, we don't use modules and haven't had a publisher ask to use them. Thanks for checking though.

daviddavis avatar May 16 '24 16:05 daviddavis

@javihernandez, it would be nice to have further feedback on this.

There are two improvements I'm trying to do with this:

  1. Improve user experience of the API
  2. Make Modulemd storage/processing consistent

I want to know how that may affect the distributed upload of modular packages inconvenience that you reported early on, about the immutability of Modulemds.

I've proposed that, when the Modulemd is added to a Repository, then Pulp will try to find the Pulp Packages (matching the listed nevras in the Modulemd) in the context of the RepositoryVersion its being added to. The main advantage of this is that is assures Modulemd and Repository consistency*.

For your workflow, that means you could upload the Module before the end of the build, but still, you would need to add the Module (via its href) to the final Repository in the end, if the uploads are successful. I'm not sure if that's helpful or not in the context of you workflow. Wdyt?


*We've been discussing about how RpmRepository constrains/consistency are really suited for Distribution workflows, but not much for Build-system workflows. We still need to understand build-system requirements better so we can have better first class support for it.

pedro-psb avatar May 16 '24 18:05 pedro-psb

Also, a more minor question about this API improvement: I'm inclined to make the snippet upload be a File rather than a String (as it is currently), because we have similar endpoints which uses a File. Any preference here?

pedro-psb avatar May 16 '24 18:05 pedro-psb

Open questions

When should the attempt to link snippet packages with Pulp Packages be performed?

I think it makes sense to expect the module snippet to be uploaded as last, meaning that all the rpms that it mentions should already be present in pulp otherwise the module should be considered corrupted. There is a similar workflow in container registry where manifest.json that describes the image composed of layers is uploaded as last and it's upload is rejected if not all layers are already present in pulp. This works well for us and makes sure pulp creates all the necessary relations and guarantees 'composite' content integrity.

Where should Modulemd look for packages?

We should always look at the latest repo version, if package(s) not available, fail the upload with meaningful message. The package might be present in pulp, but in another repo, we do not want to create relations in this case. More to this, we would to still fail the module upload into repoB, if repoB will not have the necessary packages, even if same module with packages is already present in repoA.

ipanova avatar May 20 '24 16:05 ipanova

@ipanova There is detail about linking the packages "on upload", strictly speaking, because Repository is optional. So whether we make it required or we can trigger the link "on repover creation", when its effectively being added to a repository.

pedro-psb avatar May 20 '24 21:05 pedro-psb

@pedro-psb I thought we made the upload to require repo always, otherwise the content without a repo is considered orphan and user will not have access to it because the permissions on the content are being scoped off the repo permissions. But yes, I believe we could do linking somewhere at finalize_repo_version step too, which should fail if it happens that the version-in-progress does not contain necessary packages.

ipanova avatar May 21 '24 13:05 ipanova

ok, we do not require a repo to be provided for admins https://github.com/pulp/pulpcore/blob/main/pulpcore/app/global_access_conditions.py#L488

ipanova avatar May 21 '24 14:05 ipanova