pisa
pisa copied to clipboard
About online index updating
Dear my friends, First thank you all for the great project ! This search engine is the most fancy I've found on Github ! In our case, we will have streaming data which should be updated to the exist index during queries, but I can't find any description of this feature in the doc. So is there any plan to support this feature ?
Hi there,
Unfortunately, we do not currently support online index updates, but this is something that we may be willing to support in the future. Depending on the size of the index and the frequency of the updates, there are various approaches which may be best. Perhaps you could list some more details, and we might be able to determine if anything is likely to be worked on in the future. Of course, we welcome collaborators, so you are welcome to work on this yourself too.
@amallia and @elshize - any thoughts?
I haven't really given a thought to this feature, and I don't really know what approach would be best. But I'm thinking what would be the easiest way to support that, and here are some thoughts:
As you may know from the documentation, PISA has a rather unique indexing pipeline, with indexing separated into separate stages: parsing, inverting, and compression. This diagram could help to visualize it. (btw @JMMackenzie this image doesn't render in the docs, we should probably fix that.)
I don't see a way to update index with a single document fast without serious refactoring and possibly even structural changes (but maybe I'm missing it?). However, I can see supporting updating in batches that's reasonably fast. This is not to say a batch couldn't be just a single document, but it won't be much faster than a thousand documents.
Parsing
Parsing is largely independent except for one crucial functionality: documents and terms are assigned IDs at this stage. The way it works now, though, is that parsing is done in batches, and they are then merged together, including the ID mappings. Now, during an update, merging a forward index could be optional, and only mappings would be merged if someone doesn't care about maintaining the forward index. We would need to make sure that the old document IDs stay the same.
Inverting
We could use the newly parsed forward index to build a small inverted index, and merge it with the old one. I believe we already have the mechanism to do so in our code, since we invert in batches as well.
Compression
Probably the whole thing should be rebuilt. Significantly more work would be required otherwise (I think).
Caveats
Note that the above approach means that you need to keep your uncompressed index and keep merging to it. This might or might not be a problem depending on the size.
This wouldn't be superfast but also might be acceptable, depending on your update pattern.
Also, note that most of the "merging" I refer to is taking a number of files and producing a new one. The old ones could be removed right after, but you still need roughly twice as much storage available as your index.
Advantages
The advantage of such "hacky" approach is that it would be doable in reasonable time and without overhauling the entire indexing pipeline. It also should be too difficult.
Hi, Is there any progress? I just find this awesome project and looking for some info about this feature
Unfortunately, I haven't been able to look into that. I just graduated and started a new job, plus have been busy in personal life. Currently, I cannot tell when I'd have time to look into that, but if someone else took the lead on it, I'd be probably able to help with review and discussion.