libarchive
libarchive copied to clipboard
Helper function for extracting `struct archive_entry` into `FILE*` akin to `archive_read_data_into_fd(..)`
Maybe I'm mistaken (https://github.com/libarchive/libarchive/issues/146), but I found no simple helpers / recipes on extracting an archive_entry to fd (could be memfd or a network stream), FILE* (could be fmemopen-produced FILE*) or simply a memory buffer.
Existing function for extracting into fd: https://github.com/bramp/libarchive/blob/master/libarchive/archive_read_data_into_fd.c
In libarchive repo I found two examples:
- https://github.com/libarchive/libarchive/wiki/Examples#A_Complete_Extractor
- https://github.com/libarchive/libarchive/blob/master/examples/untar.c
But both use 6 functions archive_write_disk_new / archive_write_header / archive_write_data_block / archive_write_finish_entry / archive_write_close / archive_write_free functions to extract to a directory tree structure
Are functions archive_write_open_fd() / archive_write_open_FILE / archive_write_open_memory achieving such extraction of raw entry bytes to a given fd / FILE* / mem? These function names appear as if they were purposed for archive creation, not raw entry extraction. Even if so, I would propose to simplify it and add 1-function call helpers to achieve this (at least as recipes in https://github.com/libarchive/libarchive/wiki/Examples or to https://github.com/libarchive/libarchive/tree/master/examples).
A very related ask - ideally we'd like to be able to seek to a given archive_entry and extract the entry:
- https://github.com/libarchive/libarchive/issues/1511
The helpers below assume that seek to the needed archive_entry is already done by the outer loop. Can one do the seek to the archive_entry without a loop? (of course for the underlying media which support a seek: disk file, memfd, mem buffer etc)
I rolled something like this which does extraction from a seeked archive in one-shot.
size_t extract_to_fd(struct archive* a, int fd)
{
const void* buff;
size_t len;
int64_t offset;
size_t size = 0;
for(;;)
{
int r = archive_read_data_block(a, &buff, &len, &offset);
if (r == ARCHIVE_EOF || r != ARCHIVE_OK)
break;
assert(r == ARCHIVE_OK);
size += write(fd, buff, len);
}
return size;
}
size_t extract_to_file(struct archive* a, FILE* stream)
{
const void* buff;
size_t len;
int64_t offset;
for(;;)
{
int r = archive_read_data_block(a, &buff, &len, &offset);
if (r == ARCHIVE_EOF || r != ARCHIVE_OK)
break;
assert(r == ARCHIVE_OK);
size += fwrite(buff, 1, size, stream);
}
return size;
}
size_t extract_to_mem(struct archive* a, void* buf, size_t cnt)
{
const void* buff;
size_t len;
int64_t offset;
size_t size = 0;
for(;;)
{
int r = archive_read_data_block(a, &buff, &len, &offset);
if (r == ARCHIVE_EOF || r != ARCHIVE_OK)
break;
assert(r == ARCHIVE_OK);
size_t w = len <= cnt ? len : cnt;
memcpy((char*)buf + size, buff, w);
cnt -= w;
}
return size;
}
Have you looked at archive_read_data_into_fd()? That should already handle the fd case.
For extracting into memory, you'll want to use archive_read_data() and allocate the memory in advance. You should be careful to check archive_entry_size_is_set() to make sure the size of the entry is actually known at that time. There are some unusual cases where the size of the entry can only be found out by reading the entire data.
Ohm nice, I guess then it's just the FILE* helper that's missing (its impl should be very similar to archive_read_data_into_fd). It's good to support it as libc stdio often supports working with custom stream implementations via cookie mechanism: https://www.gnu.org/software/libc/manual/html_node/Streams-and-Cookies.html . One fairly supported impl of this mechanism is fmemopen and open_memstream. E.g. extracting into open_memstream-returned FILE* lets the libc extend the buffer dynamically, which is nice for some simple applications.
I updated this issue name accordingly.
A Pull Request would be appreciated. (Please include tests when you do.)
Should a function like this work? (a simple change lseek->fseek and write->fwrite of https://github.com/bramp/libarchive/blob/master/libarchive/archive_read_data_into_fd.c)
Are there any problems with long being unsufficiently short dtype? or is the cast fine?
Regarding the test, should I just copy https://github.com/libarchive/libarchive/blob/915c9f83a4f2fd63aa82ae006b6bd9661f42a7a1/libarchive/test/test_read_data_large.c to change it to use this archive_read_data_into_FILE?
Do I also need to change the copyright / __FBSDID? Or how does this work?
/*-
* Copyright (c) 2003-2007 Tim Kientzle
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
*
* THIS SOFTWARE IS PROVIDED BY THE AUTHOR(S) ``AS IS'' AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
* IN NO EVENT SHALL THE AUTHOR(S) BE LIABLE FOR ANY DIRECT, INDIRECT,
* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
* NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
* THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include "archive_platform.h"
__FBSDID("$FreeBSD: src/lib/libarchive/archive_read_data_into_FILE.c,v 1.16 2008/05/23 05:01:29 cperciva Exp $");
#ifdef HAVE_SYS_TYPES_H
#include <sys/types.h>
#endif
#ifdef HAVE_ERRNO_H
#include <errno.h>
#endif
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif
#include "archive.h"
#include "archive_private.h"
/* Maximum amount of data to write at one time. */
#define MAX_WRITE (1024 * 1024)
/*
* This implementation minimizes copying of data and is sparse-file aware.
*/
int
archive_read_data_into_FILE(struct archive *a, FILE* f)
{
int r;
const void *buff;
size_t size, bytes_to_write;
ssize_t bytes_written, total_written;
off_t offset;
off_t output_offset;
__archive_check_magic(a, ARCHIVE_READ_MAGIC, ARCHIVE_STATE_DATA, "archive_read_data_into_FILE");
total_written = 0;
output_offset = 0;
while ((r = archive_read_data_block(a, &buff, &size, &offset)) ==
ARCHIVE_OK) {
const char *p = buff;
if (offset > output_offset) {
if(fseek(f,
(long)(offset - output_offset), SEEK_CUR) != 0)
{
archive_set_error(a, errno, "Seek error");
return (ARCHIVE_FATAL);
}
output_offset = (off_t)ftell(f);
if (output_offset != offset) {
archive_set_error(a, errno, "Seek error: offset does not match the expected value");
return (ARCHIVE_FATAL);
}
}
while (size > 0) {
bytes_to_write = size;
if (bytes_to_write > MAX_WRITE)
bytes_to_write = MAX_WRITE;
bytes_written = fwrite(p, 1, bytes_to_write, f);
if (bytes_written < 0) {
archive_set_error(a, errno, "Write error");
return (ARCHIVE_FATAL);
}
output_offset += bytes_written;
total_written += bytes_written;
p += bytes_written;
size -= bytes_written;
}
}
if (r != ARCHIVE_EOF)
return (r);
return (ARCHIVE_OK);
}
That looks good! Could you put up a PR and we can work through the remaining details.
- We'll need to add a test case.
- Standard C requires
fseekto return 0 on success. So I think yourfseeklogic here is wrong.
- Standard C requires
fseekto return 0 on success. So I think yourfseeklogic here is wrong.
Thanks! Right! Also, a question on data dtypes. fseek/ftell use long to accept and return offset. Should I use instead fseeko/ftello which accept/return off_t (which can be forced to be 64-bits with -D_FILE_OFFSET_BITS=64)? Or what is the stance of libarchive on this issue? (e.g. maybe we need to add support detection for these functions to configure scripts? and also add a macro redirecting these functions to _fseeki64/_ftelli64 on Windows)
Or maybe could we delete the output_offset calculation for the fseek/fwrite/ftell variant? (as far as I can see it is used only for sanity-checks to control that sparse file writes are working as expected) Then we also do not need the ftell call
Note: -D_FILE_OFFSET_BITS=64 does not work on all platforms (only Linux), so we can't rely on it in general. So anything that works with off_t does have to be careful about 32-bit/64-bit problems.
I think we can just delete the output_offset calculation. For the fd form, that's somewhat needed because the seek system call can succeed but have no effect. That shouldn't be possible for FILE * operations.
Recently having gone through a painful audit of glibc/musl/bionic for 32/64 bit off_t and related functions, here are some general tips:
- we should always set
-D_FILE_OFFSET_BITS=64, if we don't already - check if it changes your ABI- already set in libarchive, v3 removed any off_t/etc from the API
- never use the 64 function variants - the C runtime goes through various lengths to preserve ABI, at cost of developer/maintenance sanity
- if you rely on it being certain size - sprinkle
static_assert(sizeof(foo) == bar)in your code, both for build-checking and self-documentation
Hi @kientzle @evelikov,
I created a draft:
- https://github.com/libarchive/libarchive/pull/2504
Could you please take a look to settle the questions remaining.
And if all good, I'll then try to move the new function into a new file, write the docs, tests etc