pisa icon indicating copy to clipboard operation
pisa copied to clipboard

About online index updating

Open justmao945 opened this issue 4 years ago • 4 comments

Dear my friends, First thank you all for the great project ! This search engine is the most fancy I've found on Github ! In our case, we will have streaming data which should be updated to the exist index during queries, but I can't find any description of this feature in the doc. So is there any plan to support this feature ?

justmao945 avatar Dec 09 '20 08:12 justmao945

Hi there,

Unfortunately, we do not currently support online index updates, but this is something that we may be willing to support in the future. Depending on the size of the index and the frequency of the updates, there are various approaches which may be best. Perhaps you could list some more details, and we might be able to determine if anything is likely to be worked on in the future. Of course, we welcome collaborators, so you are welcome to work on this yourself too.

@amallia and @elshize - any thoughts?

JMMackenzie avatar Jan 04 '21 05:01 JMMackenzie

I haven't really given a thought to this feature, and I don't really know what approach would be best. But I'm thinking what would be the easiest way to support that, and here are some thoughts:

As you may know from the documentation, PISA has a rather unique indexing pipeline, with indexing separated into separate stages: parsing, inverting, and compression. This diagram could help to visualize it. (btw @JMMackenzie this image doesn't render in the docs, we should probably fix that.)

I don't see a way to update index with a single document fast without serious refactoring and possibly even structural changes (but maybe I'm missing it?). However, I can see supporting updating in batches that's reasonably fast. This is not to say a batch couldn't be just a single document, but it won't be much faster than a thousand documents.

Parsing

Parsing is largely independent except for one crucial functionality: documents and terms are assigned IDs at this stage. The way it works now, though, is that parsing is done in batches, and they are then merged together, including the ID mappings. Now, during an update, merging a forward index could be optional, and only mappings would be merged if someone doesn't care about maintaining the forward index. We would need to make sure that the old document IDs stay the same.

Inverting

We could use the newly parsed forward index to build a small inverted index, and merge it with the old one. I believe we already have the mechanism to do so in our code, since we invert in batches as well.

Compression

Probably the whole thing should be rebuilt. Significantly more work would be required otherwise (I think).

Caveats

Note that the above approach means that you need to keep your uncompressed index and keep merging to it. This might or might not be a problem depending on the size.

This wouldn't be superfast but also might be acceptable, depending on your update pattern.

Also, note that most of the "merging" I refer to is taking a number of files and producing a new one. The old ones could be removed right after, but you still need roughly twice as much storage available as your index.

Advantages

The advantage of such "hacky" approach is that it would be doable in reasonable time and without overhauling the entire indexing pipeline. It also should be too difficult.

elshize avatar Jan 04 '21 13:01 elshize

Hi, Is there any progress? I just find this awesome project and looking for some info about this feature

troycheng avatar Jul 21 '21 08:07 troycheng

Unfortunately, I haven't been able to look into that. I just graduated and started a new job, plus have been busy in personal life. Currently, I cannot tell when I'd have time to look into that, but if someone else took the lead on it, I'd be probably able to help with review and discussion.

elshize avatar Jul 21 '21 22:07 elshize