geneweb icon indicating copy to clipboard operation
geneweb copied to clipboard

Legacy format of indexes

Open Halbaroth opened this issue 8 months ago • 0 comments

The format used for indexes has been changed in commit https://github.com/geneweb/geneweb/commit/bc07bc1a.

Previously, a custom version of AVL trees from the standard library was employed for indexes. The module is named btree, but they are not B-trees at all :/ The new format uses arrays and binary search in files.

The justification of this change was:

The result is an implementation performing the same
when browsing, but building the surname and first name indexes
for 2000000+ persons takes less than 8s while btree
implementation takes about 30s.

When designing data structures for indexes, we seek a balance between:

  • Fast lookup operations.
  • The amount of memory consumed by the index in main memory.

There are two extreme scenarios:

  • Small indexes: they can be loaded entirely in memory and, in such cases, hash tables often provide excellent performances.
  • Large indexes: They must reside on disk, and hash tables or balanced binary trees are not suitable for fast lookups. This is because the primary bottleneck becomes the time to load data from disk to main memory. B-trees (and their variants, like B+-trees) are the most common data structures used in this scenario.

Based on the portion of the Roglo databases I have, the size of the largest index is about 230 Mo. While it was a large amount of memory twenty years ago, it is now relatively insignifiant.

I suggest the following:

  1. For in-memory databases, all the indexes should also be loaded into main memory.
  2. The old format with fake B-trees should be deprecated in the next release. Users should convert the old format into the new one.
  3. A new format with real B+-trees should be implemented.

Halbaroth avatar Apr 22 '25 09:04 Halbaroth