Publish appstream metadata
Is your feature request related to a problem? Please describe. We have a request to publish AppStream metadata for our repos. It doesn't look like Pulp allows you to publish AppStream metadata today.
Describe the solution you'd like Optimally, we could somehow supply the information to Pulp needed and it would publish the AppStream metadata when publishing the repo.
Describe alternatives you've considered Maybe a hook of some sort that could be called after publish? The trick would be getting the artifacts back into Pulp to be part of the publication. Not sure how feasible this would be.
Additional context It's worth noting that there's a createrepo feature request too. It was filed in 2017 though.
https://github.com/rpm-software-management/createrepo_c/issues/75
A few sticking points:
- Generating the metadata requires inspecting the full RPMs, which aren't available during the first stage of the sync, and might not be downloaded at all
- So it's not something we can store in the database alongside the content unit
- At publish time, we would either have to fail on on_demand repositories or ignore undownloaded packages
- Because it requires decompressing the whole RPM, it's expensive, so it might not be something you want to do for every publish, or for more packages than absolutely necessary
- Maybe "latest version only"
- Maybe the data needs to be cached somehow except as mentioned we can't really put it on the content unit
We don't use on-demand content currently so we're not impacted by the first issue. That said, either option you raise sounds reasonable to me.
~~Regarding the second issue, I guess I don't know enough about it because I assumed it would be stored as a published artifact for the publication?~~
Edit: Actually I think I understand the second issue you raise now. Agreed that it wouldn't be something you'd want to run everytime and for every package.
@daviddavis Basically the package has a few different components:
- Headers, with a key-value-store kind of thing that stores metadata about the package
- An archive containing the entire contents of the package
Building the traditional metadata only requires reading data from the header, which is fast. (Although in our case we're just reading it from the database rather than reading the headers).
Building appstream metadata requires unpacking the archive within the package to extract things like icons files. So the package files might be there, but you still have to do a bunch of time-consuming work with them, work that might be expensive and time consuming enough to not want to do from scratch every single time. We can't store that metadata in the database (it wouldn't make sense for a number of reasons), so it would be a matter of finding somewhere else to cache it.
One helpful optimization we can do even without caching is to only generate appstream metadata for the latest version of each package available. That's totally fine and safe to do.
In Copr, we allow people to enable this metadata in repositories. We want to move to use Pulp, but without this feature, it will be a regression from the feature POV. However, this is a low priority and has a lot of rabbit holes. See:
- https://github.com/hughsie/appstream-glib/issues/301
- https://github.com/rpm-software-management/createrepo_c/issues/75
One helpful optimization we can do even without caching is to only generate appstream metadata for the latest version of each package available.
Totally OK for Copr.
Could AppStream repository data fragments be stored in the database as part of the import process, so that repository generation is faster by working from the database entirely?
For package uploads, definitely. For synced packages, probably not without rearchitecting the entire way the sync pipeline works. The plugin has already handed off control by the time the artifacts get downloaded. And also, as mentioned, there's not even a guarantee of having the package due to on-demand.
We have a similar kind of issue w/r/t a feature request for keeping track of / filtering by the package signing keys. You can't get that info without parsing the package header directly.
It's been a while since I looked at appstream. I don't know if this would be useful or not, but theoretically speaking if we could strip down the metadata provided to just what we can easily get from the database as it exists currently, that would be a much easier MVP to put together than trying to deal with extracting the package archives and cache expensive metadata and whatnot.
If your goal is to fully replace appstream-data for Fedora or something though, obviously that wouldn't solve your problem.