sist2 icon indicating copy to clipboard operation
sist2 copied to clipboard

Additive vs. Subtractive incremental lmdb building?

Open yatli opened this issue 2 years ago • 3 comments

When copy_table is far larger than new_table, we may as well first file copy the store and then add/remove entries from it. The db file copy should be done before scanner kicks off, so that the parse process will write/overwrite entries.

yatli avatar Jan 20 '22 19:01 yatli

Current incremental behaviour:

  1. Start with an empty store
  2. During parse, only consider new/updated files and write to the store
  3. for each file in the original index that was marked for copy, load from the original store and copy it to the new store (could be a little slow)

Desired behaviour:

  1. Copy the entire original store (this is really fast/efficient)
  2. During parse, only consider new/updated files and write (or overwrite!) to the store
  3. For each file in the original index that was marked for deletion, remove from the new store

Is that correct?

One thing I'm worried about is that LMDB will not re-compact the DB, and that deleted entries will just be zeroed out without reclaiming any new space. So if there are many deleted files, the new store will not be smaller than the original store. There might be a way to ask LMDB to force de-fragmentation but I think that this is functionally equivalent to re-writing every key, which means that we're back to square one in terms of performance.

simon987 avatar Mar 05 '22 16:03 simon987

Yes exactly this.

FYI: https://www.openldap.org/lists/openldap-technical/201407/msg00134.html (note, this message is from 2014)

Fragmentation is a problem for larger-than-4k entries, but fine for smaller ones (probably automatically defrag'ed)

But this problem may be irrelevant if we're talking about repeatedly creating new db files (rather than bending a single mdb instance):

http://www.lmdb.tech/doc/group__mdb.html#ga3bf50d7793b36aaddf6b481a44e24244

During env copy we can do compaction and despite performance lose (vs. a plain copy) it's still much faster than inserting the keys one by one to a new store -- no need to sort the keys, no need for internal maintenance etc.

yatli avatar Mar 06 '22 11:03 yatli

Okay, got it. That's really useful information, thank you

simon987 avatar Mar 06 '22 14:03 simon987