libarchive
libarchive copied to clipboard
libarchive can't handle "stripped" RPM archives
I'm using the latest release 3.6.1-1.3 and I noticed the following hello world program:
#include <archive.h>
#include <archive_entry.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
if (argc == 1)
{
fprintf (stderr, "Usage: ./a.out FILE\n");
exit (0);
}
struct archive *a;
struct archive_entry *entry;
int r;
a = archive_read_new();
archive_read_support_filter_all(a);
archive_read_support_format_all(a);
r = archive_read_open_filename(a, argv[1], 10240); // Note 1
if (r != ARCHIVE_OK)
{
fprintf (stderr, "ERR: %s\n", archive_error_string(a));
exit(1);
}
while (archive_read_next_header(a, &entry) == ARCHIVE_OK) {
printf("%s\n",archive_entry_pathname(entry));
archive_read_data_skip(a); // Note 2
}
r = archive_read_free(a); // Note 3
if (r != ARCHIVE_OK)
exit(1);
}
Does not work for:
wget https://download.opensuse.org/debug/tumbleweed/repo/oss/x86_64/nodejs-electron-debuginfo-19.0.11-1.1.x86_64.rpm
gcc archive.c -larchive && ./a.out nodejs-electron-debuginfo-19.0.11-1.1.x86_64.rpm
ERR: Unrecognized archive format
Note rpm2cpio also complains about it: nodejs-electron-debuginfo-19.0.11-1.1.x86_64.rpm and I can confirm rpm2archive (from RPM project) can extract it.
@fche
Why do you think this has to do with files over 4GB?
It looks like this is an RPM package containing a zstd-compressed archive in it. Libarchive seems to not have any problems with the RPM wrapper or the zstd compression, but it does not recognize "07070X" as a valid CPIO signature. A quick google search found a few references to "rpm-style stripped cpio files" -- this appears some new cpio variant invented by the RPM folks that libarchive does not currently support.
Background: Libarchive's existing RPM support is quite simple: It recognizes and strips off the RPM wrapper and then allows the decompression filters and CPIO format handler to process the contents. This worked well for the original RPM format, where the "body" was a standard self-contained CPIO archive.
In the intervening years, RPM seems to have changed its architecture so that this simple design no longer works. The "body" of this particular RPM contains file contents but none of the association metadata (filenames, types, etc) that are necessary to properly extract it. Instead, that data is stored in the RPM header. To properly handle this in libarchive, we would probably need a complete "RPM format" handler that combines RPM header parsing, compression identification, and breaking out the contents. This is no more complex than other formats that libarchive supports, but it's a lot more complicated than our existing RPM handling.
In the short term, we could add some code to libarchive's standard CPIO format to identify these "stripped" RPM bodies and emit a more specific error message.
Thank you for the analysis and the explanation. Note my use case is https://sourceware.org/elfutils/Debuginfod.html daemon which parses RPM files and apparently, we have at least 2 packages in openSUSE Tumbleweed that exceed 4GB in size after decompression.
Guess this was invented more than 10 years ago to support >4GB files inside rpm.
https://github.com/rpm-software-management/rpm/commit/68c7cf80d7b763498d0077daa91f649bc209e7ae