lunr.js
lunr.js copied to clipboard
Not all tokens are being indexed
I have the following setup:
records = [
{
id: 1,
title: 'Test 1',
description: 'It is her couch',
url: 'www.example.com',
tags: 'a,b,c'
},
{
id: 2,
title: 'Test 2',
description: "The couch is her's",
url: 'www.sample.com',
tags: 'x,y,z'
}
];
const idx = lunr(function () {
this.ref('id');
this.field('title', { boost: 100 });
this.field('description');
this.field('tags');
this.field('url');
this.pipeline.remove(lunr.stemmer);
this.searchPipeline.remove(lunr.stemmer);
for (const record of records) {
const tunedRecord = {
...record,
description: record.description,
tags: record.tags.split(','),
url: record.url.split(/\W+/)
};
this.add(tunedRecord);
}
});
The resulting invertedIndex
is:
data:image/s3,"s3://crabby-images/098e3/098e3ba5f7c5dae8038496bbe06b193daafdec5d" alt="Screen Shot 2021-01-17 at 3 27 39 AM"
Notice that her's
(from record 2) was indexed, but her
(from record 1) was not.
The same happens if I remove record 2. A single record with the word her
does not get indexed.
Is there a trick to this or is this a bug?
FYI: I have found other similar occurrences.
I ran into this too and realized what was going on
see #480
basically, there's a default pipeline "stopWordFilter" that filters out a set of small words (including "her" or in my case "get"). If you want to include those, just remove the pipeline
config.searchPipeline.remove(lunr.stopWordFilter)
cheers!
I thin my point is that the stopWordFilter is where the bug is