lunr.js icon indicating copy to clipboard operation
lunr.js copied to clipboard

Not all tokens are being indexed

Open npearson72 opened this issue 4 years ago • 2 comments

I have the following setup:

    records = [
      {
        id: 1,
        title: 'Test 1',
        description: 'It is her couch',
        url: 'www.example.com',
        tags: 'a,b,c'
      },
      {
        id: 2,
        title: 'Test 2',
        description: "The couch is her's",
        url: 'www.sample.com',
        tags: 'x,y,z'
      }
    ];

    const idx = lunr(function () {
      this.ref('id');
      this.field('title', { boost: 100 });
      this.field('description');
      this.field('tags');
      this.field('url');

      this.pipeline.remove(lunr.stemmer);
      this.searchPipeline.remove(lunr.stemmer);

      for (const record of records) {
        const tunedRecord = {
          ...record,
          description: record.description,
          tags: record.tags.split(','),
          url: record.url.split(/\W+/)
        };

        this.add(tunedRecord);
      }
    });

The resulting invertedIndex is:

Screen Shot 2021-01-17 at 3 27 39 AM

Notice that her's (from record 2) was indexed, but her (from record 1) was not.

The same happens if I remove record 2. A single record with the word her does not get indexed.

Is there a trick to this or is this a bug?

FYI: I have found other similar occurrences.

npearson72 avatar Jan 17 '21 01:01 npearson72

I ran into this too and realized what was going on

see #480

basically, there's a default pipeline "stopWordFilter" that filters out a set of small words (including "her" or in my case "get"). If you want to include those, just remove the pipeline

config.searchPipeline.remove(lunr.stopWordFilter)

cheers!

ramirezmike avatar Feb 11 '21 01:02 ramirezmike

I thin my point is that the stopWordFilter is where the bug is

npearson72 avatar Feb 11 '21 05:02 npearson72