Pulp doesn't handle conflicting package filenames properly
Version pulpcore 3.21.0.dev pulp-rpm 3.18.0.dev
Describe the bug If you add two packages with the same filename to a repo, then only one gets published/distributed (good). However, both show up in primary.xml (not so good) and the checksum can potentially not match (bad).
To Reproduce This is a bit of a contrived example but it demonstrates the problem. Basically you add two packages with the same filename to a repo and publish it.
API Commands
wget https://fixtures.pulpproject.org/rpm-unsigned/bear-4.1-1.noarch.rpm
wget https://fixtures.pulpproject.org/rpm-unsigned/camel-0.1-1.noarch.rpm
http --form :24817/pulp/api/v3/content/rpm/packages/ [email protected] relative_path=camel-0.1-1.noarch.rpm
http --form :24817/pulp/api/v3/content/rpm/packages/ [email protected]
http :24817/pulp/api/v3/repositories/rpm/rpm/ name=test
http :24817/pulp/api/v3/distributions/rpm/rpm/ name=test base_path=test repository=/pulp/api/v3/repositories/rpm/rpm/4d390079-60ff-4916-b80f-51b6756f2757/
http :24817/pulp/api/v3/repositories/rpm/rpm/4d390079-60ff-4916-b80f-51b6756f2757/modify/ add_content_units:='["/pulp/api/v3/content/rpm/packages/290113b4-0ca2-4bb4-8b5a-725d471d865a/", "/pulp/api/v3/content/rpm/packages/8df08649-2272-4006-8270-eee1d453edd2/"]'
http :24817/pulp/api/v3/publications/rpm/rpm/ repository=/pulp/api/v3/repositories/rpm/rpm/4d390079-60ff-4916-b80f-51b6756f2757/
Expected behavior When I view the packages I see only one (camel) which is expected:
http :24816/pulp/content/test/Packages/c/
When I get its checksum, I get the checksum for the package named formerly known as bear-4.1-1.noarch.rpm which is also acceptable:
http :24816/pulp/content/test/Packages/c/camel-0.1-1.noarch.rpm | sha256sum
ceb0f0bb58be244393cc565e8ee5ef0ad36884d8ba8eec74542ff47d299a34c1
However, when I download the primary.xml, I see two entries: one for bear at Packages/b/bear-4.1-1.noarch.rpm (which doesn't exist) with a checksum of ceb0f0 and one for camel at Packages/c/camel-0.1-1.noarch.rpm which does exist but it has a checksum of c5c34 (which doesn't match the checksum of the actual package at /pulp/content/test/Packages/c/camel-0.1-1.noarch.rpm).
Here's a copy of the primary.xml:
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns="http://linux.duke.edu/metadata/common" xmlns:rpm="http://linux.duke.edu/metadata/rpm" packages="2">
<package type="rpm">
<name>bear</name>
<arch>noarch</arch>
<version epoch="0" ver="4.1" rel="1"/>
<checksum type="sha256" pkgid="YES">ceb0f0bb58be244393cc565e8ee5ef0ad36884d8ba8eec74542ff47d299a34c1</checksum>
<summary>A dummy package of bear</summary>
<description>A dummy package of bear</description>
<packager></packager>
<url>http://tstrachota.fedorapeople.org</url>
<time file="1659114662" build="1331831374"/>
<size package="1846" installed="42" archive="296"/>
<location href="Packages/b/bear-4.1-1.noarch.rpm"/>
<format>
<rpm:license>GPLv2</rpm:license>
<rpm:vendor></rpm:vendor>
<rpm:group>Internet/Applications</rpm:group>
<rpm:buildhost>smqe-ws15</rpm:buildhost>
<rpm:sourcerpm>bear-4.1-1.src.rpm</rpm:sourcerpm>
<rpm:header-range start="280" end="1697"/>
<rpm:provides>
<rpm:entry name="bear" flags="EQ" epoch="0" ver="4.1" rel="1"/>
</rpm:provides>
</format>
</package>
<package type="rpm">
<name>camel</name>
<arch>noarch</arch>
<version epoch="0" ver="0.1" rel="1"/>
<checksum type="sha256" pkgid="YES">c5c34e1843847990d2c79b936309d6257279e26f907e20f9f58073a14525e1ef</checksum>
<summary>A dummy package of camel</summary>
<description>A dummy package of camel</description>
<packager></packager>
<url>http://tstrachota.fedorapeople.org</url>
<time file="1659114663" build="1331831360"/>
<size package="1847" installed="42" archive="296"/>
<location href="Packages/c/camel-0.1-1.noarch.rpm"/>
<format>
<rpm:license>GPLv2</rpm:license>
<rpm:vendor></rpm:vendor>
<rpm:group>Internet/Applications</rpm:group>
<rpm:buildhost>smqe-ws15</rpm:buildhost>
<rpm:sourcerpm>camel-0.1-1.src.rpm</rpm:sourcerpm>
<rpm:header-range start="280" end="1697"/>
<rpm:provides>
<rpm:entry name="camel" flags="EQ" epoch="0" ver="0.1" rel="1"/>
</rpm:provides>
</format>
</package>
</metadata>
FYI, it sounds like we'll not be allowing users to set relative_path on upload so the package should be named using nevra and thus, this isn't going to be a high priority for us.
https://bugzilla.redhat.com/show_bug.cgi?id=2151657
K so I'm finally looking at this one. Here's the state of things.
Pulp has one repository-level uniqueness filter (not constraint in the database sense, it's enforced by logic). That filter as it currently exists is N + E + V + R + A + "location_href".
location_href is filled in with NEVRA if you don't manually override it using the "relative_path" parameter. But you can set that parameter to anything, which is not great, especially because it isn't part of the global package unique constraint (so whatever value you use during an upload will be used everywhere).
Scenarios where it behaves CORRECTLY:
-
If you add two packages with the same NEVRA and different checksums (one signed one unsigned) to one repository without setting the "relative_path", then the repo-unique-filter will match, and the repo will end up with only one, as the newer one will replace the older one.
-
If you add two packages with the same NEVRA and different checksums to one repository while using the same "relative_path" for each
Scenarios where it behaves KIND OF CORRECTLY:
- If you add two packages with the same NEVRA (one signed one unsigned) to one repository with different "relative_path"s set, then both will be allowed into the same repository, but when publishing only one will appear in the metadata. The metadata will not match the metadata pulp describes for the repository version, though.
- As per the fix we applied in this patch, if you provide a "relative_path" of
bear-1.0-1.noarch.rpmwhen uploading the packagecamel-1.0-1.noarch.rpm, the package will be published under the namebear-1.0-1.noarch.rpm- This totally sucks and is stupid, but some people have done something comparable.
Scenarios where it behaves INCORRECTLY:
- If you take two different-NEVRA packages and give them overlapping "relative_path" on purpose, then they will both be allowed in the same repository, and they will all appear in the metadata, but only one will "win" at content-distribution time.
- So take the last example - if
bear-1.0-1.noarch.rpmexists in the same repository, then the two will silently conflict and you might have issues w/ DNF. - The symptoms are the same as a previous bug we encountered, described here except that was worse because it happened whenever you used
relative_pathwith a value that didn't exactly match the one expected based on the NEVRA
- So take the last example - if
The general TL;DR is that using the "relative_path" parameter on the upload endpoint generally leads to suboptimal outcomes.
How to Fix:
- We could remove the
relative_pathparam from our API entirely or make it no-op. Technically against SEMVER, but if we consider it broken, there's wiggle room. - We could provide two repo uniqueness filters - one for NEVRA and one for location_href. But A) Pulp doesn't allow that currently and B) it doesn't resolve the problem that that package + checksum is already immutably tied to a BS relative_path. i.e. https://github.com/pulp/pulp_rpm/issues/2931
We could do one or both but either is painful and not a complete fix. Fixing data that already exists is a can of worms I'm going to discuss on the BZ not here. That's a separate conversation.
Neither of these feel like great options? But I would probably prefer the former in the short-term.
We don't allow regular users to set the relative_path for packages but I think we want to allow admins to do so--let me double check on this. Would having a config variable that defaults to False (e.g. ALLOW_UNSAFE_SETTING_OF_RPM_NAMES) be an option?
Given Daniels comment and analyses, I agree that providing relative_path on the upload creates more problems than benefits, where the benefits are still unclear.
I would be in favor of removing in the upload api the relative_path or making it no-op. We'd also then attempt to fix already existing data by properly setting the package.location_href and content_artifact.relative_path stored in DB.
@daviddavis everything that has in it's name unsafe implies no guarantee and ask for trouble for us to maintain somehow the corner case as for the user who will get non-deterministic results. Unless there is a justifiable strong usecase to keep relative_path on the upload I'd remove it/make it no-op.
I checked and we don't the ability to set filenames when uploading rpms.
We'd also then attempt to fix already existing data by properly setting the package.location_href and content_artifact.relative_path stored in DB.
This makes me a bit nervous. Would it be possible to let users opt and/or have a dry run option that would list packages that have bad location_hrefs/relative_paths?
@daviddavis That probably won't be a problem for the django management command