elasticlunr.js
elasticlunr.js copied to clipboard
Searching partial words
Hello, I'm not sure how to configure elasticlunr(and if it's possible) to search partial words in provided documents.
For example I have:
function createSearchIndex(index) {
var reverseIndex = elasticlunr(function() {
this.addField("name");
this.addField("url");
this.addField("topics");
this.addField("description");
this.setRef("id");
this.saveDocument(false);
});
index.forEach(function(value, id) {
if (value.verified) {
reverseIndex.addDoc({
name: value.name,
url: value.url,
topics: value.topics,
description: value.description,
id: id
});
}
});
return reverseIndex;
}
my search function is:
function(phrase, index, reverseIndex) {
return new Promise(function(resolve) {
resolve(reverseIndex.search(phrase, {
fields: {
name: {
boost: 3
},
url: {
boost: 2
},
topics: {
boost: 3
},
description: {
boost: 1
}
}
}));
}).then(function(results) {
return results.map(function(result) {
var entry = index[result.ref];
return {
url: entry.url,
name: entry.name,
topics: entry.topics,
description: entry.description,
popularity: entry.popularity
};
});
});
};
example index content:
[
{
"url": "https:\/\/gitlab.com\/rili\/service\/calltrace.git",
"name": "rili\/service\/calltrace",
"topics": ["c++", "rili", "service"],
"description": "Allow to collect stacktraces\/calltraces\/backtraces",
"popularity": 1,
"verified": true,
"invalid": false
},
{
"url": "https:\/\/gitlab.com\/rili\/compatibility.git",
"name": "rili\/compatibility",
"topics": ["c++", "rili"],
"description": "Rili macros, type definitions etc. which help to care about compatibility between language standards and compiler versions.",
"popularity": 1,
"verified": true,
"invalid": false
}
]
I would like to use phrase: "stacktrace" to find "https://gitlab.com/rili/service/calltrace.git" or phrase "rili" to find both "https://gitlab.com/rili/service/calltrace.git" and "https://gitlab.com/rili/compatibility.git".
How can I achive this with elasticlunr?
Try using Token Expand.
In your search()
call arguments, include expand: true
.
For me that's not working. Let's say I have a title Lorem ipsum
, searhing for orem
doesn't giving any result.
@wintercounter @paambaati It would be great to have this supported. Currently, I'm in the same boat with this issue. Another example is urls. When you have www.google.com
, searching for google
doesn't return any result. It search has to be www.g
in order for the result to come back. I've played with adding www.
to the stop words list, but my guess is the tokenizer is treating the entire url like a word. The only work around I've come up with is to use a dual-search system. If elasticlunr returns nothing, then I use regex on the entire content
var text ='goog'
let pattern = new RegExp(text, 'gi');
let regex = content.match(pattern);
if (regex && regex.length > 0) {
// add to search results, but score will need to be adjusted for this
}
@mtycholaz @paambaati Two things, with two concerns:
Expansion is done left-to-right. Under the hood, expand
will travel down the inverted index. It is currently a potentially-infinitely-deep JS object representing a tree where each level represents a character. For instance, in your case, the index would look like so:
- L
\ o
\ r
\ e
\ m
Due to this structure, in 0.9.5 it would be highly unrealistic to be able to expand right-to-left, as this would require going to every single deepest branch of the index and travel upwards, then look at the N-1th level and repeat.
I've made changes in the next version to turn this index into an independent and customizable internal storage format, where the default implementation is a straight-up K-V map, for those cases like this one, where the keys can be scanned through using an arbitrary function. However, this is still in prototype stage.
Regarding URLs, The currently released version of elasticlunr does not allow you to do this constructively, albeit it does allow you to do it destructively by defining your own tokenizing pipeline. The problem with this is that the current version does not allow multiple tokens to simultaneously reference the same item in a string. Let's take an example to justify the importance of this feature; suppose you are storing URLs and you stored foo.bar/baz#1
and foo.bar/baz#2
. Due to the hash symbolizing a client-side-only parameter, you could stem both of those urls to foo.bar/baz
; however, this prevents you from having a client search foo.bar
and return all pages referencing pages from there without having to do a full sweep of the inverted index (and the previously mentioned expand
issue prevents you from mapping www.foo.bar
as foo.bar
, which, while a valid behavior, still is an issue for most).
My current fork currently allows this but it is not as straightforward an endeavour as I would like. This is definitely an area of improvement and I'd be interested to hear from both of you a bit more about your requirements so we can get this right :-)