elasticlunr.js icon indicating copy to clipboard operation
elasticlunr.js copied to clipboard

Searching partial words

Open ksanderon opened this issue 7 years ago • 4 comments

Hello, I'm not sure how to configure elasticlunr(and if it's possible) to search partial words in provided documents.

For example I have:

function createSearchIndex(index) {
    var reverseIndex = elasticlunr(function() {
        this.addField("name");
        this.addField("url");
        this.addField("topics");
        this.addField("description");
        this.setRef("id");
        this.saveDocument(false);
    });

    index.forEach(function(value, id) {
        if (value.verified) {
            reverseIndex.addDoc({
                name: value.name,
                url: value.url,
                topics: value.topics,
                description: value.description,
                id: id
            });
        }
    });

    return reverseIndex;
}

my search function is:

function(phrase, index, reverseIndex) {
    return new Promise(function(resolve) {
        resolve(reverseIndex.search(phrase, {
            fields: {
                name: {
                    boost: 3
                },
                url: {
                    boost: 2
                },
                topics: {
                    boost: 3
                },
                description: {
                    boost: 1
                }
            }
        }));
    }).then(function(results) {
        return results.map(function(result) {
            var entry = index[result.ref];
            return {
                url: entry.url,
                name: entry.name,
                topics: entry.topics,
                description: entry.description,
                popularity: entry.popularity
            };
        });
    });
};

example index content:

[
  {
    "url": "https:\/\/gitlab.com\/rili\/service\/calltrace.git",
    "name": "rili\/service\/calltrace",
    "topics": ["c++", "rili", "service"],
    "description": "Allow to collect stacktraces\/calltraces\/backtraces",
    "popularity": 1,
    "verified": true,
    "invalid": false
  },
  {
    "url": "https:\/\/gitlab.com\/rili\/compatibility.git",
    "name": "rili\/compatibility",
    "topics": ["c++", "rili"],
    "description": "Rili macros, type definitions etc. which help to care about compatibility between language standards and compiler versions.",
    "popularity": 1,
    "verified": true,
    "invalid": false
  }
]

I would like to use phrase: "stacktrace" to find "https://gitlab.com/rili/service/calltrace.git" or phrase "rili" to find both "https://gitlab.com/rili/service/calltrace.git" and "https://gitlab.com/rili/compatibility.git".

How can I achive this with elasticlunr?

ksanderon avatar Oct 05 '17 09:10 ksanderon

Try using Token Expand.

In your search() call arguments, include expand: true.

paambaati avatar Nov 07 '17 18:11 paambaati

For me that's not working. Let's say I have a title Lorem ipsum, searhing for orem doesn't giving any result.

wintercounter avatar Jun 12 '19 11:06 wintercounter

@wintercounter @paambaati It would be great to have this supported. Currently, I'm in the same boat with this issue. Another example is urls. When you have www.google.com, searching for google doesn't return any result. It search has to be www.g in order for the result to come back. I've played with adding www. to the stop words list, but my guess is the tokenizer is treating the entire url like a word. The only work around I've come up with is to use a dual-search system. If elasticlunr returns nothing, then I use regex on the entire content

var text ='goog'
let pattern = new RegExp(text, 'gi');
let regex = content.match(pattern);
if (regex && regex.length > 0) {
    // add to search results, but score will need to be adjusted for this
}

mtycholaz avatar Dec 12 '19 18:12 mtycholaz

@mtycholaz @paambaati Two things, with two concerns:

Expansion is done left-to-right. Under the hood, expand will travel down the inverted index. It is currently a potentially-infinitely-deep JS object representing a tree where each level represents a character. For instance, in your case, the index would look like so:

   -       L
    \      o
     \     r
      \    e
       \   m 

Due to this structure, in 0.9.5 it would be highly unrealistic to be able to expand right-to-left, as this would require going to every single deepest branch of the index and travel upwards, then look at the N-1th level and repeat.

I've made changes in the next version to turn this index into an independent and customizable internal storage format, where the default implementation is a straight-up K-V map, for those cases like this one, where the keys can be scanned through using an arbitrary function. However, this is still in prototype stage.


Regarding URLs, The currently released version of elasticlunr does not allow you to do this constructively, albeit it does allow you to do it destructively by defining your own tokenizing pipeline. The problem with this is that the current version does not allow multiple tokens to simultaneously reference the same item in a string. Let's take an example to justify the importance of this feature; suppose you are storing URLs and you stored foo.bar/baz#1 and foo.bar/baz#2. Due to the hash symbolizing a client-side-only parameter, you could stem both of those urls to foo.bar/baz; however, this prevents you from having a client search foo.bar and return all pages referencing pages from there without having to do a full sweep of the inverted index (and the previously mentioned expand issue prevents you from mapping www.foo.bar as foo.bar, which, while a valid behavior, still is an issue for most).

My current fork currently allows this but it is not as straightforward an endeavour as I would like. This is definitely an area of improvement and I'd be interested to hear from both of you a bit more about your requirements so we can get this right :-)

srenauld avatar Dec 16 '19 11:12 srenauld