SearchIndex: Rearranging the Index class structure

Open splitbrain opened this issue 4 years ago • 1 comments

Incremental Improvement for #3556

This is a first step at stuff at restructuring the indexing classes a bit more.

Some background:

We have basically two different kind of index files:

a) RowIndex (like page.idx)

Each line in the index contains a single value. The line number is used as primary ID. These files can be very large. Thus an index like that should never be read into memory completely if it can be avoided.

b) TupleIndex (like i12.idx)

Each line contains a list of tuples. The files tend to be smaller so loading them completely for search and replace is easier.

Since the the access is so completely different, I tried to model that in the two different classes, basically moving the methods from \dokuwiki\Search\AbstractIndex to the new classes.

While doing so, I tried to make the doc blocks, variable names and interface easier to understand. I also added tests for each of the methods.

The old code has not been touched yet. So these classes do not do anything outside of tests currently.

I also think that it might be useful to have a \dokuwiki\Search\Index\PageIndex inheriting from RowIndex providing a few more page-specific methods.

The next step would be to try just remove \dokuwiki\Search\AbstractIndex and try to model the Fulltext and Metadata Indexes as Collections.

Dec 04 '21 15:12 splitbrain

After working some more on this, I notice that the distinction between the two types of Index files might not be so clear cut. The "reverse" pageword.idx index is a TupleIndex by content, but might actually be the largest index we have and should not be loaded into memory.

I think we may need to actually split this into MemoryIndex and FileIndex but have tuple operations available on both. I will update this PR when I made up my mind ;-)

Dec 04 '21 19:12 splitbrain