HashIndex header size not a multiple of 8
As just discussed in IRC: The current HashIndex (on disk!) data structure has a header whose size is not a multiple of 4 or 8. This makes it a bit cumbersome (and on some platforms also probably slower) to use when mmap'd.
I understand that mmap is not used within borg to access the hashindex-based structures at this point in time, so it may not be worth fixing. Yet it seems like a small enough change to squeeze into Borg 2.0 which may avoid some pain further down the road (if borg ever goes to mmap'ing the indices).
Notes:
- master branch currently needs to support old and new variants of the misc hashindexes and their on-disk formats so
borg rcreateandborg transfercan read from old repos via their--other-repoparam. - see
Repository.open_indexabout how it useshashindex_variant()to determine the on-disk variant and then uses the correspondingNSIndex(new) orNSIndex1(old, borg 1.x) classes.
If we could have some simple / small changes that would achieve the alignment, I guess we should go for it.
We'd have to change this struct (from _hashindex.c):
BORG_PACKED(
typedef struct {
char magic[MAGIC_LEN];
int32_t num_entries;
int32_t num_buckets;
int8_t key_size;
int8_t value_size;
}) HashHeader;
It is currently 8+4+4+2 = 18 bytes long. It would be good to pad it to 20 bytes (multiple of 4) or even better 24 bytes (multiple of 8).
This makes it a bit complex and would likely require a new magic number to detect properly.
That would then also be a splendid opportunity to get rid of the signed numbers, they never made sense to me and signed overflow is dangerous in C.
Instead of making the alignment magically using c magic, how about allocating something like 512b or 1kb for arbitrary header data, then making use of sparse
@RonnyPfannschmidt sparse as in sparse files?
he may have meant something like:
HL_MAX = 512 # or even 4096?
hl = len(header)
assert hl <= HL_MAX
with open(...) as f:
f.write(header)
f.seek(HL_MAX - hl, SEEK_CURR)
f.write(anything_else)
Ah, right, that's feasible (and equivalent to adapting the C header struct accordingly). Still probably requires a new MAGIC though, doesn't it?
Something like that
Guess BORG2IDX could be the new magic.
@horazont in case you want to work on that, now (as b2 is out) would be a good time.
Working on this...
Bit of a pain to work with that code:
- C code
- needs to still be able to read the old hashindex file format,
- while also supporting the new file format.
- the hash computed while reading the file causes additional problems because it expects all places in the file get read exactly once and in sequential order. I solved this by separately opening the file in the python part of the code and checking for the magic. BORG_IDX means the legacy file format and legacy layout of the hashtable, BORG2IDX means the new file format and the new layout of the hashtable.
Done:
- all stuff int32 now (except the magic and reserved)
- added a version int32 directly after the magic and set it to 2 (like borg 2). the old header had no version info, but could be denoted as version 1 in case we ever need it (currently it decides based on the magic).
- added num_empty as indicated by a code comment, so it does not need a full hashtable scan to determine the amount of empty buckets.
- to keep it simpler, I just filled the HashHeader struct with a
char reserved[1024 - 32];- 1024 being the desired overall header size and 32 being the currently used size.
@horazont can you review #7064 ?