lunr.js Adding/Updating/Removing documents from a built Index

I'm a little annoyed at the assumption that lunr is only being used for static content, as we have quite the opposite situation. We're using lunr to provide search on a list that might grow, shrink, or have it's items modified at any time.

I was looking forward to updating to version 2 but this change completely breaks our use case. My current workaround is to completely rebuild the entire index on every change. This is would be incredibly inefficient.

I understand the benefits of immutability, but my feeling is that it could have been implemented internally without breaking the API.

Is there any way they could be added back to v2?

Jul 11 '17 13:07 ncphillips

I'm sorry that the move to immutable indexes has annoyed you.

I was looking forward to updating to version 2

Out of interest, which features added in version 2 were you looking forward to using?

Lunr 1.x is still available and supported, i.e. bugs will be fixed if they are reported. I'm not planning on adding any major features there, mostly because that architecture has been pushed as far as possible from what I can see, though pull requests will be accepted.

I understand the benefits of immutability, but my feeling is that it could have been implemented internally without breaking the API.

By making indexes immutable the internal data structures that Lunr uses were able to be made significantly more space efficient, also indexes no longer need to store as much data. Specifically lunr.TokenStore is much smaller than the trie used in Lunr 1.x. We also no longer need to store all the tokens for each document. In turn this allowed adding features like more advanced wildcard matching, field search and query time boosts.

Is there any way they could be added back to v2?

Not without increasing the size of the index and the complexity of the implementation.

I'm aware that these changes came at a cost of being able to update an index, but, on balance I still think it was the right decision.

Jul 12 '17 16:07 olivernn

Very interesting. I might have to do some benchmarking to see what the performance effect would be for our use case. Primarily I'm wondering if the cost of having to completely re-build the index every time there is an update to the dataset would offset the benefit of the improvements around size and speed.

On that note, if you are planning to continue supporting 1.x then I am not quite as worried about upgrading to 2.x

Jul 13 '17 02:07 ncphillips

Do the benchmarking, having data to base the decision on will help. I would imagine that re-building the entire index every time there is new data to add would be quite expensive though. If you can arrange to do that in a web worker it might not be too bad, but its still not ideal.

Yes, 1.x will be receiving bug fixes for the foreseeable future.

I'm going to close this now, because, in the short term anyway, Lunr 2.x is not going to allow mutating indexes. Feel free to re-open if there is something else you want to discuss.

Jul 13 '17 18:07 olivernn

As per the conversation on Gitter, please re-open this issue.

Sep 03 '17 19:09 abastardi

You mention a smaller index, field search, and query time boosting as some advantages of the new architecture. Elasticlunr.js claims the same benefits, though it has a mutable index. What tradeoffs has Elasticlunr made to achieve this?

Sep 03 '17 19:09 abastardi

Quoting @abastardi from the Gitter channel:

I haven't reviewed the code in detail, but it looks like after all the documents are added, the build() method of Builder is called to generate the index. Would it be possible to rebuild the index with a new document without adding all the documents again -- by adding just a single new document via the add() method and then re-running the build() method? If so, would that be a lot more efficient than a complete rebuild from scratch, or is most of the work in the build() method rather than in adding all the documents? Finally, are there docs for v1 somewhere?

Sep 04 '17 19:09 olivernn

@abastardi re Elasticlunr, I don't know, I'm not familiar with development on that library, though specifically regarding index size, this claim was probably made against pre 2.x versions of Lunr.

Would it be possible to rebuild the index with a new document without adding all the documents again

This is what I was sort of talking about with merging indexes, i.e. when adding a new document(s) you create a new index, and then somehow merge them together. Updating could work in a similar way, not sure about removing.

If you look at an instance of lunr.Index you will see the three main data structures, fieldVectors, invertedIndex and tokenSet.

Of these the invertedIndex is the easiest to update, its just a map from token to document reference, plus some metadata.

fieldVectors is a map of 'field/document_ref' to an instance of lunr.Vector. At first glance this might appear easy to just mutate (insert, delete etc), but the dimensions of the vectors are calculated using the inverse document frequency, that is, the inverse of the number of documents that contain the term for that dimension. As soon as you change the number of documents that contain a term it will effectively invalidate all vectors for document fields that contain that term.

Finally there is the tokenSet, this is just a space efficient trie that contains the set of all tokens. In order to keep the size of this data structure small it is built in a way that compresses prefixes and suffixes of strings, to do this tokens must be inserted in order. I think this data structure is probably the hardest to update, and maybe the most expensive to build :(

So, thats probably a long way of saying "its complicated". As mentioned before, I'm not against having the index be mutable, I just don't want it to happen at the expense of index size / search performance / available search features.

Sep 04 '17 19:09 olivernn

@olivernn Thanks for the breakdown. Maybe a start would be to at least allow updates to the invertedIndex -- so if there are a large number of documents, changing a single document doesn't require recreating the entire invertedIndex.

Regarding the tokenSet, I wonder if it would be feasible to have an option for a non-compressed version that would be easier to update (at the cost of taking extra space). That way you can still have the benefit of space efficiency when mutability is not needed but also cover use cases that require mutability.

Sep 04 '17 21:09 abastardi

@olivernn Hi there, and thanks for all your hard work on lunr.js! I wanted to chime in with my use case for mutable indexes - I wrote a TiddlyWiki plugin to enable full text search a while ago (https://github.com/hoelzro/tw-full-text-search/) using lunr.js 1.0.0. I use a strategy of updating the index every time a document changes - building a fresh index on my personal with several thousand documents takes a few minutes otherwise. I would like to forward port my plugin to lunr.js 2.0 so that I can use the newer pipeline metadata feature, but it sounds like I'm stuck on 1.0.0 for the time being. I understand that you had to make a difficult decision about which tradeoffs lunr.js needed to make, but I wanted to provide a real-world use case in case you're keeping track of these for future development.

Nov 10 '17 20:11 hoelzro

I just want to say, this is strange that you no longer can update current index in v2. There are so many use cases where you want to do that, especially on server with index of your app documents.

I needed a query by field functionality but it is only available in v2, so after after upgrade to v2 I was a very surprise that you no longer can use update method.

Nov 20 '17 18:11 kurdin

If anyone's interested, I started a PR (https://github.com/olivernn/lunr.js/pull/315) adding a lunr.MutableBuilder and lunr.MutableIndex type for, well, mutable indexes. I'm going to be using them for my plugin, and feedback on the work would be helpful!

Nov 20 '17 20:11 hoelzro

@ncphillips could you share your method to rebuild the index?

I have a hundred documents (Italian and English) and each time the user requests a full text search I rebuild the index if the set of documents has changed. The only small problem is that on subsequent requests I receive the warning: Overwriting existing registered function: lunr-multi-trimmer-en-it. Seems I forgot to delete (or preserve) something across the index rebuild. Here is the code:

"use strict";

const lunr = require("lunr");
require("lunr-languages/lunr.stemmer.support")(lunr);
require("lunr-languages/lunr.multi")(lunr);
require("lunr-languages/lunr.it")(lunr);

// > Access to the index for full text search
let fullTextIndex;

// > Generate the index for full text search
exports.regenerateFullTextIndex = function(documents) {
	fullTextIndex = lunr(function() {

		this.use(lunr.multiLanguage("en", "it"));

		this.ref("id");
		this.field("title");
		this.field("body");

		const len = documents.length;
		for(let i = 0; i < len; ++i) {
			this.add({id: documents[i].id, title: documents[i].title, body: documents[i].text});
		}
	});
};

The calling function is very simple (hasFileListChanged is set true every time I modify the set of documents:

if(hasFileListChanged) {
        search.regenerateFullTextIndex(documents);
	hasFileListChanged = false;
}
// Do the full text search

Thanks for the help!

Apr 18 '18 10:04 crystalfp

@crystalfp as of now, we have opted to not upgrade lunr.js to version 2, so we just add/remove individual indices in response to webhook events

Apr 18 '18 15:04 ncphillips

@olivernn

[...] 1.x will be receiving bug fixes for the foreseeable future.

By making indexes immutable the internal data structures that Lunr uses were able to be made significantly more space efficient, also indexes no longer need to store as much data.

I'm aware that these changes came at a cost of being able to update an index, but, on balance I still think it was the right decision.

I disagree, from an architectural point of view.

Space efficiency vs. mutability: Lets say I want to integrate a search feature into a desktop application that is based on Electron. This application is designed to work on it's own, that means: there is no server backend. The size of managed data will grow over time. As a result the index will grow as well. Due to the business and usability requirements of that application, the user will expect that a search result is always based on the latest data, even right after a data change. That means I need to update the search index after each change as fast as possible. The expected data change rate is tricky. During activity peaks we can expect a lot of small changes. After that we can also expect that nothing will be changed for a while. So, for the dry periods an immutable index will be acceptable, but for the work periods, I need a flexible and mutable index.

Every data change will force me to rebuild the entire index because of the immutability. Since the data will grow, this is going to slow down over time. From my perspective this is a good way to waste a lot of CPU time, for no good reason. But who cares, it's just the client's CPU, right?! Even a pull request like #315 is just a cosmetic/convenient API extension: "The index is completely rebuilt when a document is added/updated/removed". This API extension provides an easy way to "update" an index, but in reality the index is completely rebuilt. I won't gain anything from it.

In this particular case (I would also say: a real-world-scenario), mutability is more important than space efficiency. I literally don't care that the index is 20% bigger, if I can change the index efficiently.
v1 vs. v2: You suggest to use version 1 in cases when mutability is required. But this is not really an option. It's a dead end. Even if you promise to add bugfixes to that version (which is already more than 1 year old), that particular version has no future. That means, there won't be performance improvements or feature enhancements. I won't be able to enhance the search feature of my application over time, because the feature set of the underlying index library is fixed. As a result the search feature of my application would be limited by design. I can't choose a library/framework with no future for an application with an estimated lifetime of 5 years. This would be a technology risk right from the start. Even if I assume that there will be alternatives in the future, that's just hope as a tactic.

Sadly, there are no alternatives. lunr.js seems to be the only JavaScript based index library that is in active development. That means, you're holding all the strings here, just sayin' ;)

Jun 01 '18 09:06 ghost

@taxidriver61, these projects are not highly active, but you might also consider js-search, elasticlunr, ndx, and search-index (some require you to add your own stemmer).

Jun 01 '18 21:06 abastardi

@taxidriver61 @abastardi You might also want to check out my lunr extension, lunr-mutable-indexes, which adds mutable index support to 2.x!

Jun 02 '18 15:06 hoelzro

That means, you're holding all the strings here

That means, there won't be performance improvements or feature enhancements

Well, its open source with a permissive license, so...

@taxidriver61 I don't mean to trivialise your comment, clearly rebuilding the index each time a single document is modified isn't as efficient as being able to just (re)index the modified document.

IIRC one of the large changes between v1 and v2 is that v2 pre-computes the vector space representation of the documents. The weights in that vector space are calculated based on the entire collection of documents in the index. Specifically the IDF and the average field length. Any additions/removals would have to update vectors based on these. Getting the updated IDF is probably doable after the index is built, but the average field lengths currently aren't available in the index. They could be added to the index, but then you'd have to figure out how to incrementally update them based on adding/removing documents.

Additionally the token set would need to be maintained, currently it doesn't expose any methods for updating the set. The data structure is reasonably complex, adding mutability would be involved but doable. If all else fails multiple immutable token sets could be kept and combined at search time.

Finally, the invertedIndex is probably the easiest part to add mutability support to, its just a JavaScript object of objects pretending to be a Map.

Jun 02 '18 21:06 olivernn

@abastardi @ncphillips @crystalfp if you are ok with a smaller feature set, you can also check out my library https://github.com/lucaong/minisearch

I fully understand the Lunr design decision, it's a matter of trade-offs. Lunr offers an impressive range of features, and it's reasonable to use an immutable index to keep complexity manageable. MiniSearch chooses a different trade-off, is focused on small index size and flexible API, and can afford a mutable index at the cost of smaller feature set.

Oct 04 '18 21:10 lucaong

@lucaong, minisearch looks very useful. Thanks for mentioning it.

Oct 05 '18 01:10 abastardi