libarchive icon indicating copy to clipboard operation
libarchive copied to clipboard

Helper function for extracting `struct archive_entry` into `FILE*` akin to `archive_read_data_into_fd(..)`

Open vadimkantorov opened this issue 1 year ago • 9 comments

Maybe I'm mistaken (https://github.com/libarchive/libarchive/issues/146), but I found no simple helpers / recipes on extracting an archive_entry to fd (could be memfd or a network stream), FILE* (could be fmemopen-produced FILE*) or simply a memory buffer.

Existing function for extracting into fd: https://github.com/bramp/libarchive/blob/master/libarchive/archive_read_data_into_fd.c

In libarchive repo I found two examples:

  • https://github.com/libarchive/libarchive/wiki/Examples#A_Complete_Extractor
  • https://github.com/libarchive/libarchive/blob/master/examples/untar.c

But both use 6 functions archive_write_disk_new / archive_write_header / archive_write_data_block / archive_write_finish_entry / archive_write_close / archive_write_free functions to extract to a directory tree structure

Are functions archive_write_open_fd() / archive_write_open_FILE / archive_write_open_memory achieving such extraction of raw entry bytes to a given fd / FILE* / mem? These function names appear as if they were purposed for archive creation, not raw entry extraction. Even if so, I would propose to simplify it and add 1-function call helpers to achieve this (at least as recipes in https://github.com/libarchive/libarchive/wiki/Examples or to https://github.com/libarchive/libarchive/tree/master/examples).

A very related ask - ideally we'd like to be able to seek to a given archive_entry and extract the entry:

  • https://github.com/libarchive/libarchive/issues/1511

The helpers below assume that seek to the needed archive_entry is already done by the outer loop. Can one do the seek to the archive_entry without a loop? (of course for the underlying media which support a seek: disk file, memfd, mem buffer etc)

I rolled something like this which does extraction from a seeked archive in one-shot.

size_t extract_to_fd(struct archive* a, int fd)
{
    const void* buff;
    size_t len;
    int64_t offset;
    size_t size = 0;
    for(;;)
    {
        int r = archive_read_data_block(a, &buff, &len, &offset);
        if (r == ARCHIVE_EOF || r != ARCHIVE_OK)
            break;
        assert(r == ARCHIVE_OK);
        size += write(fd, buff, len);
    }
    return size;
}

size_t extract_to_file(struct archive* a, FILE* stream)
{
    const void* buff;
    size_t len;
    int64_t offset;
    for(;;)
    {
        int r = archive_read_data_block(a, &buff, &len, &offset);
        if (r == ARCHIVE_EOF || r != ARCHIVE_OK)
            break;
        assert(r == ARCHIVE_OK);
        size += fwrite(buff, 1, size, stream);
    }
    return size;
}

size_t extract_to_mem(struct archive* a, void* buf, size_t cnt)
{
    const void* buff;
    size_t len;
    int64_t offset;
    size_t size = 0;
    for(;;)
    {
        int r = archive_read_data_block(a, &buff, &len, &offset);
        if (r == ARCHIVE_EOF || r != ARCHIVE_OK)
            break;
        assert(r == ARCHIVE_OK);
        size_t w = len <= cnt ? len : cnt;
        memcpy((char*)buf + size, buff, w);
        cnt -= w;
    }
    return size;
}

vadimkantorov avatar Aug 29 '24 15:08 vadimkantorov

Have you looked at archive_read_data_into_fd()? That should already handle the fd case.

For extracting into memory, you'll want to use archive_read_data() and allocate the memory in advance. You should be careful to check archive_entry_size_is_set() to make sure the size of the entry is actually known at that time. There are some unusual cases where the size of the entry can only be found out by reading the entire data.

kientzle avatar Aug 31 '24 05:08 kientzle

Ohm nice, I guess then it's just the FILE* helper that's missing (its impl should be very similar to archive_read_data_into_fd). It's good to support it as libc stdio often supports working with custom stream implementations via cookie mechanism: https://www.gnu.org/software/libc/manual/html_node/Streams-and-Cookies.html . One fairly supported impl of this mechanism is fmemopen and open_memstream. E.g. extracting into open_memstream-returned FILE* lets the libc extend the buffer dynamically, which is nice for some simple applications.

I updated this issue name accordingly.

vadimkantorov avatar Aug 31 '24 14:08 vadimkantorov

A Pull Request would be appreciated. (Please include tests when you do.)

kientzle avatar Sep 03 '24 02:09 kientzle

Should a function like this work? (a simple change lseek->fseek and write->fwrite of https://github.com/bramp/libarchive/blob/master/libarchive/archive_read_data_into_fd.c)

Are there any problems with long being unsufficiently short dtype? or is the cast fine?

Regarding the test, should I just copy https://github.com/libarchive/libarchive/blob/915c9f83a4f2fd63aa82ae006b6bd9661f42a7a1/libarchive/test/test_read_data_large.c to change it to use this archive_read_data_into_FILE?

Do I also need to change the copyright / __FBSDID? Or how does this work?

/*-
 * Copyright (c) 2003-2007 Tim Kientzle
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *
 * THIS SOFTWARE IS PROVIDED BY THE AUTHOR(S) ``AS IS'' AND ANY EXPRESS OR
 * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
 * IN NO EVENT SHALL THE AUTHOR(S) BE LIABLE FOR ANY DIRECT, INDIRECT,
 * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
 * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
 * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
 * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
 * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

#include "archive_platform.h"
__FBSDID("$FreeBSD: src/lib/libarchive/archive_read_data_into_FILE.c,v 1.16 2008/05/23 05:01:29 cperciva Exp $");

#ifdef HAVE_SYS_TYPES_H
#include <sys/types.h>
#endif
#ifdef HAVE_ERRNO_H
#include <errno.h>
#endif
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif

#include "archive.h"
#include "archive_private.h"

/* Maximum amount of data to write at one time. */
#define MAX_WRITE       (1024 * 1024)

/*
 * This implementation minimizes copying of data and is sparse-file aware.
 */
int
archive_read_data_into_FILE(struct archive *a, FILE* f)
{
        int r;
        const void *buff;
        size_t size, bytes_to_write;
        ssize_t bytes_written, total_written;
        off_t offset;
        off_t output_offset;

        __archive_check_magic(a, ARCHIVE_READ_MAGIC, ARCHIVE_STATE_DATA, "archive_read_data_into_FILE");

        total_written = 0;
        output_offset = 0;

        while ((r = archive_read_data_block(a, &buff, &size, &offset)) ==
            ARCHIVE_OK) {
                const char *p = buff;
                if (offset > output_offset) {
                        if(fseek(f,
                              (long)(offset - output_offset), SEEK_CUR) != 0)
                        {
                            archive_set_error(a, errno, "Seek error");
                            return (ARCHIVE_FATAL);
                        }
                        output_offset = (off_t)ftell(f);
                        if (output_offset != offset) {
                                archive_set_error(a, errno, "Seek error: offset does not match the expected value");
                                return (ARCHIVE_FATAL);
                        }
                }
                while (size > 0) {
                        bytes_to_write = size;
                        if (bytes_to_write > MAX_WRITE)
                                bytes_to_write = MAX_WRITE;
                        bytes_written = fwrite(p, 1, bytes_to_write, f);
                        if (bytes_written < 0) {
                                archive_set_error(a, errno, "Write error");
                                return (ARCHIVE_FATAL);
                        }
                        output_offset += bytes_written;
                        total_written += bytes_written;
                        p += bytes_written;
                        size -= bytes_written;
                }
        }

        if (r != ARCHIVE_EOF)
                return (r);
        return (ARCHIVE_OK);
}

vadimkantorov avatar Sep 15 '24 22:09 vadimkantorov

That looks good! Could you put up a PR and we can work through the remaining details.

  • We'll need to add a test case.
  • Standard C requires fseek to return 0 on success. So I think your fseek logic here is wrong.

kientzle avatar Sep 16 '24 02:09 kientzle

  • Standard C requires fseek to return 0 on success. So I think your fseek logic here is wrong.

Thanks! Right! Also, a question on data dtypes. fseek/ftell use long to accept and return offset. Should I use instead fseeko/ftello which accept/return off_t (which can be forced to be 64-bits with -D_FILE_OFFSET_BITS=64)? Or what is the stance of libarchive on this issue? (e.g. maybe we need to add support detection for these functions to configure scripts? and also add a macro redirecting these functions to _fseeki64/_ftelli64 on Windows)

Or maybe could we delete the output_offset calculation for the fseek/fwrite/ftell variant? (as far as I can see it is used only for sanity-checks to control that sparse file writes are working as expected) Then we also do not need the ftell call

vadimkantorov avatar Sep 16 '24 09:09 vadimkantorov

Note: -D_FILE_OFFSET_BITS=64 does not work on all platforms (only Linux), so we can't rely on it in general. So anything that works with off_t does have to be careful about 32-bit/64-bit problems.

I think we can just delete the output_offset calculation. For the fd form, that's somewhat needed because the seek system call can succeed but have no effect. That shouldn't be possible for FILE * operations.

kientzle avatar Sep 16 '24 19:09 kientzle

Recently having gone through a painful audit of glibc/musl/bionic for 32/64 bit off_t and related functions, here are some general tips:

  • we should always set -D_FILE_OFFSET_BITS=64, if we don't already - check if it changes your ABI
    • already set in libarchive, v3 removed any off_t/etc from the API
  • never use the 64 function variants - the C runtime goes through various lengths to preserve ABI, at cost of developer/maintenance sanity
  • if you rely on it being certain size - sprinklestatic_assert(sizeof(foo) == bar) in your code, both for build-checking and self-documentation

evelikov avatar Sep 18 '24 12:09 evelikov

Hi @kientzle @evelikov,

I created a draft:

  • https://github.com/libarchive/libarchive/pull/2504

Could you please take a look to settle the questions remaining.

And if all good, I'll then try to move the new function into a new file, write the docs, tests etc

vadimkantorov avatar Feb 06 '25 23:02 vadimkantorov