lunr.js
lunr.js copied to clipboard
Split Indexes
(If this is a dup of #29, my apologies, but I think it's a different goal...)
Is it possible in theory to break up a lunr.js index into separate .js files loaded on demand rather than having to load the entire index to do the search? I have some very large static websites (Wikipedia For Schools, distributed in the developing world) that could benefit from lunr.js, but which would need an inordinately large index, and thus an unacceptable delay, when loading the search page. I was thinking that if the index was split into parts -- maybe by first letter -- the parts could be selectively loaded depending on the search term, speeding up individual searches. Does this sound feasible? If so I may poke around and see if I can figure it out.
Also, much thanks for lunr.js, I'm already using it on our static Khan Academy distribution: http://rachel.worldpossible.org/ka/ - it's wonderful!
It sounds like your suggesting progressively loading an index from several dumps, or splitting out a serialised index into smaller chunks.
This isn't something I've considered before, I think it could be a similar issue as with merging indexes together. Whilst I'm quite interested in this feature I haven't spent a huge amount of time thinking about how to best implement it. From what I've heard, simply merging serialised indexes, especially large ones, can be quite slow - http://stackoverflow.com/questions/22528104/can-fullproof-update-its-full-text-index-without-a-complete-re-index/22556707?noredirect=1#comment34609015_22556707
As mentioned in that comment there probably needs to be a smarter way of merging the serialised index for this approach to work at any reasonable scale. Any solution is also very dependent on the data structures that lunr uses for its index, and I'm actually playing around with different implementations for the token store at the moment. Nothing is set in stone yet and I don't want to break any solution that you come up with.
Perhaps a simpler approach is to have multiple indexes or shards, that each contain a subset of the documents you need to search. You could load each shard as required, and even dispose of these shards when they are no longer necessary. You would then have to make sure that for each query issued it is sent to all the relevant indexes. There might be some degradation in the search results accuracy/relevancy as certain aspects of the scoring, specifically the IDF calculation, currently assume they have access to the whole corpus. You would have to test how this affects search result accuracy in practise.
That is a very long-winded way of saying "I don't know", which probably isn't a huge amount of help to you! Perhaps give the multiple instances of indexes a try, I don't think it should be too much effort to implement. Do keep me updated on your progress with this and I'll definitely consider ways to make index merging more efficient.
As an aside, your project sounds very interesting and that it is having a real positive impact on learning on developing parts of the world, kudos to you! I'm glad lunr is a small part of helping you achieve your goals.
Thanks very much for the reply. I am not sure of my terminology, but your description of index shards sounds like what I was thinking. I'll poke around when I get a chance and let you know if I come up with anything useful.
Indeed, lunr.js has been a great help. Thanks again.
We have this issue with lunr.js currently. I wasn't expecting such a large index file (yet) with less than 200 articles being indexed however it is going over 10mb which actually makes it faster to build the index at runtime when using a computer. I will need to split the files but was going to do it so they all load once the page is created but async it so it is faster. The one index per letter is an interesting idea.
I am thinking of a way to have a small and large index. One searches articles in last x days and something to indicate you want to search back later with the bigger one.
We have sites with thousands of articles being merged over into this new system which uses lunr.js but if I can't find a solution to these large index files we will have to look elsewhere and I don't want to because lunr.js fits our needs best apart from this one issue.
So with that said, any other ideas? I'm going to strip html to save some space (I probably should be doing this anyway?).
Okay I just gzipped it and have dropped size to less than 500kb which is fine. So will probably be fine with the one file for a little while yet.
Can I ask how you implemented gzip? I was not able to work with gzip files when serving static content from a local filesystem. Is there a trick I don't know about?
(Whoops, didn't mean to close)
@robclancy I'm interested too!
Well I am using middleman so I just use their extension and during build I call a node command to run a script which builds the index. All this happens before the middleman extension goes through and gzips everything.
However I assume your question is more to how to get the server to use the gzip file. Basically you just have a .gz version beside your normal file. So in my case I have search_index.json and search_index.json.gz. Now I just have to tell the server to use this value. Usually a server will be set to doing some gzipping itself but it will use a low compression level and skip large files as it has to be able to gzip really fast otherwise the benefits you get from size are negligible because the server has taken too long to process. However when you have static content you can manually gzip everything (or have a build script do it) and just tell the server to send the gzipped version with the response. I am currently using nginx, so to do this I just have.
gzip off;
gzip_static on;
gzip_disable "msie6";
... other gzip settings unrelated ...
So the gzip off turns off the runtime gziping and the static tells it to look for a .gz version of a file. And to not bother with any of it for IE 6.
Since you are most likely using node to build your index then you should use grunt or gulp to build the index and then gzip it whenever your files are updated in some way.
Just google how to serve gzipped files with {insert http server here}.
Thanks for the detailed reply. I was actually wondering about the case of locally stored static files accessed through the file system without any type of server software. I am not sure it's possible to use gzip in those cases - but I will keep poking around. Cheers!
Oh I forgot that part of your question. The way I have it handled is it looks for the search_index.json file and if found loads the index. If not it will build the index there and then. So during development my search page builds the index at runtime which is usually around 2 seconds (with my timeouts to stop the page from freezing). When I build out the website the index is created so it exists in live environments and doesn't have to build the index at runtime.
I think that gzip/deflate compression when loading scripts etc in the browser is controlled by HTTP and so it isn't used when loading static files without HTTP, probably worth a test though.
You could do the Gzip decompression in JavaScript, I'm not sure how well it will perform (probably badly) but I guess the trade off between index size and time to decompress etc is something you have to figure out for your use-case.
The server sends gzip with http. If the browser supports it it will download the gzipped version and then decompress it which is faster than downloading the uncompressed file. I don't see a use case where you would be loading the file without http and if you were you could simply do the same thing as the browser and get the smaller version.
Not sure what you are getting at tbh. This works.
Thanks for the feedback. Yes, my use case is not common - I was trying to serve the files without http through the file:// protocol. This was for people in remote areas who are not connected to the internet, and I am installing a large set of browsable offline materials. The other option is to run an http server locally, and it seems that is the way I'll have to go. It makes setup slightly more complicated and slightly more brittle (most of these people don't have access to computer support for a year at a time), but I think in the end that's what will work. Thanks again.
It sounds like a custom solution so just send the gzipped version in the first place?
This is an issue with lunr.js imo. Gzipping helps for me for now but when I move one of the large sites (a LOT of posts) over this problem will be back and I will need to do something about it. Being able to use chunks would be nice.
Actually the issue I was having wasn't the size during delivery (we do gzip), it's the speed of loading the search index for the end user at run time. I doubt gzipping will help much in that case. I am still more interested in splitting the indexes somehow so that lunr.js doesn't have to load/parse the whole file (over 1MB in my case) to perform each search. My follow-up question on gzip was mostly unrelated to the original question - your comment got me thinking and I was just wondering if it was possible to browse local files in the .html.gz format as that would save lots of space on the end user's machine post install. It's all good food for thought, though!
This is another thing that will become a performance issue for me as well then. I'm starting to think running node and doing fast little http requests will be better for me.
@robclancy slightly off-topic, but for that use case, you could also do an optional single-file python web server. Even just a one-liner in the web directory.
python -m SimpleHTTPServer 8080
Of course, Windows would be more complicated. Of course ;). But a start.cmd could probably do it.
@needlestack, did you ever figure out a good way to "merge" the indexes. The problem I run into is maintaining a singular score.
I'm also looking into how to shard a lunr index; ideally so that only a small % of the total number of shards ever needs to be used for a given (limited) query.
Has anyone made any progress with this; in thought or code?