borg HashIndex header size not a multiple of 8

As just discussed in IRC: The current HashIndex (on disk!) data structure has a header whose size is not a multiple of 4 or 8. This makes it a bit cumbersome (and on some platforms also probably slower) to use when mmap'd.

I understand that mmap is not used within borg to access the hashindex-based structures at this point in time, so it may not be worth fixing. Yet it seems like a small enough change to squeeze into Borg 2.0 which may avoid some pain further down the road (if borg ever goes to mmap'ing the indices).

Aug 09 '22 19:08 horazont

Notes:

master branch currently needs to support old and new variants of the misc hashindexes and their on-disk formats so borg rcreate and borg transfer can read from old repos via their --other-repo param.
see Repository.open_index about how it uses hashindex_variant() to determine the on-disk variant and then uses the corresponding NSIndex (new) or NSIndex1 (old, borg 1.x) classes.

If we could have some simple / small changes that would achieve the alignment, I guess we should go for it.

Aug 10 '22 12:08 ThomasWaldmann

We'd have to change this struct (from _hashindex.c):

BORG_PACKED(
typedef struct {
    char magic[MAGIC_LEN];
    int32_t num_entries;
    int32_t num_buckets;
    int8_t  key_size;
    int8_t  value_size;
}) HashHeader;

It is currently 8+4+4+2 = 18 bytes long. It would be good to pad it to 20 bytes (multiple of 4) or even better 24 bytes (multiple of 8).

This makes it a bit complex and would likely require a new magic number to detect properly.

That would then also be a splendid opportunity to get rid of the signed numbers, they never made sense to me and signed overflow is dangerous in C.

Aug 10 '22 16:08 horazont

Instead of making the alignment magically using c magic, how about allocating something like 512b or 1kb for arbitrary header data, then making use of sparse

Aug 10 '22 17:08 RonnyPfannschmidt

@RonnyPfannschmidt sparse as in sparse files?

Aug 14 '22 10:08 horazont

he may have meant something like:

HL_MAX = 512  # or even 4096?
hl = len(header)
assert hl <= HL_MAX
with open(...) as f:
    f.write(header)
    f.seek(HL_MAX - hl, SEEK_CURR)
    f.write(anything_else)

Aug 14 '22 11:08 ThomasWaldmann

Ah, right, that's feasible (and equivalent to adapting the C header struct accordingly). Still probably requires a new MAGIC though, doesn't it?

Aug 14 '22 12:08 horazont

Something like that

Aug 14 '22 12:08 RonnyPfannschmidt

Guess BORG2IDX could be the new magic.

Aug 14 '22 17:08 ThomasWaldmann

@horazont in case you want to work on that, now (as b2 is out) would be a good time.

Sep 11 '22 19:09 ThomasWaldmann

Working on this...

Bit of a pain to work with that code:

C code
needs to still be able to read the old hashindex file format,
while also supporting the new file format.
the hash computed while reading the file causes additional problems because it expects all places in the file get read exactly once and in sequential order. I solved this by separately opening the file in the python part of the code and checking for the magic. BORG_IDX means the legacy file format and legacy layout of the hashtable, BORG2IDX means the new file format and the new layout of the hashtable.

Done:

all stuff int32 now (except the magic and reserved)
added a version int32 directly after the magic and set it to 2 (like borg 2). the old header had no version info, but could be denoted as version 1 in case we ever need it (currently it decides based on the magic).
added num_empty as indicated by a code comment, so it does not need a full hashtable scan to determine the amount of empty buckets.
to keep it simpler, I just filled the HashHeader struct with a char reserved[1024 - 32]; - 1024 being the desired overall header size and 32 being the currently used size.

Sep 30 '22 17:09 ThomasWaldmann

@horazont can you review #7064 ?

Sep 30 '22 17:09 ThomasWaldmann