orama
orama copied to clipboard
Count term occurrencies in document
Is your feature request related to a problem? Please describe. Right now, we're not considering how many times a term appears in a document.
For instance, given the following strings:
-
"It's alive! It's alive!"
-
"I saw them alive"
When searching for "alive"
, the first string should have priority as the term "alive"
appears twice.
That sounds great @micheleriva! I think it would be a good idea to implement the sort
parameter (and thus the method), within the search
function, to sort (using occurrencies just for the moment) the elements found. If sort is not specified, the selection would act as usual showing the data as insert order.
Occurrences are very important data in a search action, could be they selected as default?
@mateonunez could you write some pseudocode to explain how the sort
function would work? I wouldn't make it optional, Lyra should always search by taking occurrences in consideration
@micheleriva I have prepared a branch with working (and incomplete) code for it.
The sort
options accepted are: none
(as usual) and relevance
(by occurrences). The "relevance" could be accepted as default if you want to implement a search by occurrences natively.
There are 2 methods: getOccurrences(document, index, term)
that return the count of the occurrences in the document (uses a Regex for it). The other method is sortResults(results, sort)
with a simple switch to manage all accepted sorting algorithms.
@mateonunez thanks! Any idea on how that affects performances?
Also, in which scenario would I want to sort the results by "random" order?
Initially, this implementation affects a lot of the performance of the search action.
Current timing:
{
elapsed: '114μs',
hits: [
{
id: 'PmLDkPyPWlRPG5qFcmS_7',
quote: "It's alive! It's alive!",
author: 'H.P. Lovecraft'
}
],
count: 1
}
With occurrences implementation:
{
elapsed: '158μs',
hits: [
{
id: 'qUMSJWBHvxJ0Islfwpfpz',
quote: "It's alive! It's alive!",
author: 'H.P. Lovecraft',
occurrences: 2
}
],
count: 1
}
That implementation cost ~27.8% more than the search time.
With the sort implementation:
{
elapsed: '162μs',
hits: [
{
id: 'KGpFwdn5mbXgEpCRXamhd',
quote: "It's alive! It's alive!",
author: 'H.P. Lovecraft',
occurrences: 2
}
],
count: 1
}
The first occurrencesTime
cost about 40-50μs
. Then the cost it's about 0-3μs
:
{
id: 'kdNXamfKo_CCFpTATPOvk',
quote: 'I saw them alive. Alive man. You got it? Alive.',
author: 'H.P. Lovecraft',
occurrences: 3,
occurrencesTime: '46μs'
},
{
id: 'DQ0mXcVNSnv4hJ8Du34k8',
quote: "It's alive! It's alive!",
author: 'H.P. Lovecraft',
occurrences: 2,
occurrencesTime: '3μs'
},
{
id: '_E5tQaXAXwuNmNFqSoCea',
quote: 'I saw them alive. Alive man. You got it? Alive.',
author: 'H.P. Lovecraft',
occurrences: 3,
occurrencesTime: '1μs'
},
{
id: '7HqOPlGMRJtXXtKdh9eXp',
quote: 'alive in the middle of the night.',
author: 'H.P. Lovecraft',
occurrences: 1,
occurrencesTime: '1μs'
},
{
id: 'JJ0Mec0ze_ivNFrZO4grh',
quote: "Oh, I, oh I'm still alive",
author: 'H.P. Lovecraft',
occurrences: 1,
occurrencesTime: '1μs'
},
{
id: 'gSfPxZwmoY2tcu05aywoS',
quote: 'I feel alive',
author: 'H.P. Lovecraft',
occurrences: 1,
occurrencesTime: '0μs'
},
{
id: 'NbYe5iNSLkJBwKNRj97ti',
quote: 'Alive is the only thing I know',
author: 'H.P. Lovecraft',
occurrences: 1,
occurrencesTime: '0μs'
},
{
id: 'fXGKFc6oFdVMFk4OBc7Og',
quote: 'Alive is the only thing I know',
author: 'H.P. Lovecraft',
occurrences: 1,
occurrencesTime: '0μs'
}
Regarding the random
(or first in first out) order I was imagining to a situation where we need the data as we usually handle queues. So the first document
inserted in the database will be the first document selected in the search method.
I've tried it with 1M of documents. The result is amazing! @micheleriva
Without implementation:
{
elapsed: '421ms',
hits: [
{
id: 'ZpRIoxvyzosuI6yNW47cx',
quote: 'I saw them alive. Alive man. Did you get it? Alive.',
author: 'H.P. Lovecraft'
}
],
count: 1000000
}
With implementation:
{
elapsed: '426ms',
hits: [
{
id: '3PEwoThPgs0we8IBlxIoi',
quote: 'I saw them alive. Alive man. Did you get it? Alive.',
author: 'H.P. Lovecraft',
occurrences: 3
}
],
count: 1000000
}
This is pretty rad. Would you mind opening a PR?
I could work on it, but there is just one question to understand: what should Lyra do in case the limit
parameter is set? Should it compute all occurrences (sort them) and then truncate the output? This operation could require more stress.
Currently, Lyra limits the results quitting from the cycle that retrieves the documents, then the occurrences are computed.
This means that: if I add 10 consecutive documents (increasing the occurrence by 1 each cycle) and then set the limit
to 2, Lyra will only consider the first 2 documents (the first documents inserted).
Simple Search with Relevance
{
elapsed: '244μs',
hits: [
{
id: 'DhQVMo6RI5IMpJyr8C5xf',
quote: 'Repeat with me: alive, alive, alive',
author: 'Oscar Wilde',
occurrences: 3
},
{
id: 'w6dIlijToS9GOaKEG6fBm',
quote: 'Repeat with me: alive, alive',
author: 'Oscar Wilde',
occurrences: 2
},
{
id: 'ME5jTyPK7slIZK9p0tu6u',
quote: 'Repeat with me: alive',
author: 'Oscar Wilde',
occurrences: 1
}
],
count: 3
}
Search with Relevance and Limit set to 1:
{
elapsed: '22μs',
hits: [
{
id: 'ME5jTyPK7slIZK9p0tu6u',
quote: 'Repeat with me: alive',
author: 'Oscar Wilde',
occurrences: 1
}
],
count: 3
}
What do you think about it?
@mateonunez linear search is not a good option; I think we might count word occurrences of a given term in a document during the indexing process and use this information to determine where which doc contains the same word the most.
This is also related to #13, as the exact order search is more important than the occurrences of a term in a document.
I think we could start with this issue and then move on with #13
Going "pseudo-logic":
private async _insert({ doc, id, language }: QueueDocParams): Promise<void> {
const index = this.index;
this.docs.set(id, doc);
function recursiveTrieInsertion(doc: object) {
for (const key in doc) {
if (typeof (doc as any)[key] === "object") {
throw ERRORS.UNSUPPORTED_NESTED_PROPERTIES();
// @todo nested properties will be supported with the nexrt versions
// recursiveTrieInsertion((doc as any)[key]);
} else if (typeof (doc as any)[key] === "string") {
const requestedTrie = index.get(key);
const tokens = tokenize((doc as any)[key], language);
++ const tokenFrequency = getTokenFrequency(tokens); // <-------- new function to count token frequency
for (const token of tokens) {
-- requestedTrie?.insert(token, id);
++ requestedTrie?.insert(token, [id, frequency]); // <------ just an idea, it doesn't have to be an array at all costs
}
}
}
}
recursiveTrieInsertion(doc);
}
I would evolve around this idea to keep the time complexity as low as possible
Going "pseudo-logic":
private async _insert({ doc, id, language }: QueueDocParams): Promise<void> { const index = this.index; this.docs.set(id, doc); function recursiveTrieInsertion(doc: object) { for (const key in doc) { if (typeof (doc as any)[key] === "object") { throw ERRORS.UNSUPPORTED_NESTED_PROPERTIES(); // @todo nested properties will be supported with the nexrt versions // recursiveTrieInsertion((doc as any)[key]); } else if (typeof (doc as any)[key] === "string") { const requestedTrie = index.get(key); const tokens = tokenize((doc as any)[key], language); ++ const tokenFrequency = getTokenFrequency(tokens); // <-------- new function to count token frequency for (const token of tokens) { -- requestedTrie?.insert(token, id); ++ requestedTrie?.insert(token, [id, frequency]); // <------ just an idea, it doesn't have to be an array at all costs } } } } recursiveTrieInsertion(doc); }
I would evolve around this idea to keep the time complexity as low as possible
Totally agree. I had thought that the best solution was to work directly on the composing Tries action.
I would like to play with your pseudo-logic and if the result is interesting I might do a PR, removing the "sort" in favour of #13.
@DanieleFedeli yes, tf-idf
could be the way to go. When inserting a new document, we do that by creating a queue and indexing one doc at a time. That means that on very large documents, we could spend some more time analyzing the tokens using tf-idf
or a similar technique (Jaccard index, etc.)
FYI I'm working on a prototype with tf-idf
, let's see how it goes
FYI, the work is proceeding on https://github.com/nearform/lyra/pull/63
Should this ticket be closed?