orama Count term occurrencies in document

Is your feature request related to a problem? Please describe. Right now, we're not considering how many times a term appears in a document.

For instance, given the following strings:

"It's alive! It's alive!"
"I saw them alive"

When searching for "alive", the first string should have priority as the term "alive" appears twice.

Jul 14 '22 09:07 micheleriva

That sounds great @micheleriva! I think it would be a good idea to implement the sort parameter (and thus the method), within the search function, to sort (using occurrencies just for the moment) the elements found. If sort is not specified, the selection would act as usual showing the data as insert order.

Occurrences are very important data in a search action, could be they selected as default?

Jul 14 '22 15:07 mateonunez

@mateonunez could you write some pseudocode to explain how the sort function would work? I wouldn't make it optional, Lyra should always search by taking occurrences in consideration

Jul 14 '22 16:07 micheleriva

@micheleriva I have prepared a branch with working (and incomplete) code for it.

The sort options accepted are: none (as usual) and relevance (by occurrences). The "relevance" could be accepted as default if you want to implement a search by occurrences natively.

There are 2 methods: getOccurrences(document, index, term) that return the count of the occurrences in the document (uses a Regex for it). The other method is sortResults(results, sort) with a simple switch to manage all accepted sorting algorithms.

Jul 14 '22 17:07 mateonunez

@mateonunez thanks! Any idea on how that affects performances?

Jul 15 '22 07:07 micheleriva

Also, in which scenario would I want to sort the results by "random" order?

Jul 15 '22 07:07 micheleriva

Initially, this implementation affects a lot of the performance of the search action.

Current timing:

{
  elapsed: '114μs',
  hits: [
    {
      id: 'PmLDkPyPWlRPG5qFcmS_7',
      quote: "It's alive! It's alive!",
      author: 'H.P. Lovecraft'
    }
  ],
  count: 1
}

With occurrences implementation:

{
  elapsed: '158μs',
  hits: [
    {
      id: 'qUMSJWBHvxJ0Islfwpfpz',
      quote: "It's alive! It's alive!",
      author: 'H.P. Lovecraft',
      occurrences: 2
    }
  ],
  count: 1
}

That implementation cost ~27.8% more than the search time.

With the sort implementation:

{
  elapsed: '162μs',
  hits: [
    {
      id: 'KGpFwdn5mbXgEpCRXamhd',
      quote: "It's alive! It's alive!",
      author: 'H.P. Lovecraft',
      occurrences: 2
    }
  ],
  count: 1
}

The first occurrencesTime cost about 40-50μs. Then the cost it's about 0-3μs:

    {
      id: 'kdNXamfKo_CCFpTATPOvk',
      quote: 'I saw them alive. Alive man. You got it? Alive.',
      author: 'H.P. Lovecraft',
      occurrences: 3,
      occurrencesTime: '46μs'
    },
    {
      id: 'DQ0mXcVNSnv4hJ8Du34k8',
      quote: "It's alive! It's alive!",
      author: 'H.P. Lovecraft',
      occurrences: 2,
      occurrencesTime: '3μs'
    },
    {
      id: '_E5tQaXAXwuNmNFqSoCea',
      quote: 'I saw them alive. Alive man. You got it? Alive.',
      author: 'H.P. Lovecraft',
      occurrences: 3,
      occurrencesTime: '1μs'
    },
    {
      id: '7HqOPlGMRJtXXtKdh9eXp',
      quote: 'alive in the middle of the night.',
      author: 'H.P. Lovecraft',
      occurrences: 1,
      occurrencesTime: '1μs'
    },
    {
      id: 'JJ0Mec0ze_ivNFrZO4grh',
      quote: "Oh, I, oh I'm still alive",
      author: 'H.P. Lovecraft',
      occurrences: 1,
      occurrencesTime: '1μs'
    },
    {
      id: 'gSfPxZwmoY2tcu05aywoS',
      quote: 'I feel alive',
      author: 'H.P. Lovecraft',
      occurrences: 1,
      occurrencesTime: '0μs'
    },
    {
      id: 'NbYe5iNSLkJBwKNRj97ti',
      quote: 'Alive is the only thing I know',
      author: 'H.P. Lovecraft',
      occurrences: 1,
      occurrencesTime: '0μs'
    },
    {
      id: 'fXGKFc6oFdVMFk4OBc7Og',
      quote: 'Alive is the only thing I know',
      author: 'H.P. Lovecraft',
      occurrences: 1,
      occurrencesTime: '0μs'
    }

Regarding the random (or first in first out) order I was imagining to a situation where we need the data as we usually handle queues. So the first document inserted in the database will be the first document selected in the search method.

Jul 15 '22 08:07 mateonunez

I've tried it with 1M of documents. The result is amazing! @micheleriva

Without implementation:

{
  elapsed: '421ms',
  hits: [
    {
      id: 'ZpRIoxvyzosuI6yNW47cx',
      quote: 'I saw them alive. Alive man. Did you get it? Alive.',
      author: 'H.P. Lovecraft'
    }
  ],
  count: 1000000
}

With implementation:

{
  elapsed: '426ms',
  hits: [
    {
      id: '3PEwoThPgs0we8IBlxIoi',
      quote: 'I saw them alive. Alive man. Did you get it? Alive.',
      author: 'H.P. Lovecraft',
      occurrences: 3
    }
  ],
  count: 1000000
}

Jul 15 '22 11:07 mateonunez

This is pretty rad. Would you mind opening a PR?

Jul 15 '22 12:07 micheleriva

I could work on it, but there is just one question to understand: what should Lyra do in case the limit parameter is set? Should it compute all occurrences (sort them) and then truncate the output? This operation could require more stress.

Currently, Lyra limits the results quitting from the cycle that retrieves the documents, then the occurrences are computed. This means that: if I add 10 consecutive documents (increasing the occurrence by 1 each cycle) and then set the limit to 2, Lyra will only consider the first 2 documents (the first documents inserted).

Simple Search with Relevance

{
  elapsed: '244μs',
  hits: [
    {
      id: 'DhQVMo6RI5IMpJyr8C5xf',
      quote: 'Repeat with me: alive, alive, alive',
      author: 'Oscar Wilde',
      occurrences: 3
    },
    {
      id: 'w6dIlijToS9GOaKEG6fBm',
      quote: 'Repeat with me: alive, alive',
      author: 'Oscar Wilde',
      occurrences: 2
    },
    {
      id: 'ME5jTyPK7slIZK9p0tu6u',
      quote: 'Repeat with me: alive',
      author: 'Oscar Wilde',
      occurrences: 1
    }
  ],
  count: 3
}

Search with Relevance and Limit set to 1:

{
  elapsed: '22μs',
  hits: [
    {
      id: 'ME5jTyPK7slIZK9p0tu6u',
      quote: 'Repeat with me: alive',
      author: 'Oscar Wilde',
      occurrences: 1
    }
  ],
  count: 3
}

What do you think about it?

Jul 15 '22 13:07 mateonunez

@mateonunez linear search is not a good option; I think we might count word occurrences of a given term in a document during the indexing process and use this information to determine where which doc contains the same word the most.

This is also related to #13, as the exact order search is more important than the occurrences of a term in a document.

I think we could start with this issue and then move on with #13

Jul 15 '22 16:07 micheleriva

Going "pseudo-logic":

  private async _insert({ doc, id, language }: QueueDocParams): Promise<void> {
    const index = this.index;
    this.docs.set(id, doc);

    function recursiveTrieInsertion(doc: object) {
      for (const key in doc) {
        if (typeof (doc as any)[key] === "object") {
          throw ERRORS.UNSUPPORTED_NESTED_PROPERTIES();
          // @todo nested properties will be supported with the nexrt versions
          // recursiveTrieInsertion((doc as any)[key]);
        } else if (typeof (doc as any)[key] === "string") {
          const requestedTrie = index.get(key);
          const tokens = tokenize((doc as any)[key], language);
++        const tokenFrequency = getTokenFrequency(tokens); // <-------- new function to count token frequency

          for (const token of tokens) {
--            requestedTrie?.insert(token, id);
++            requestedTrie?.insert(token, [id, frequency]); // <------ just an idea, it doesn't have to be an array at all costs
          }
        }
      }
    }

    recursiveTrieInsertion(doc);
  }

I would evolve around this idea to keep the time complexity as low as possible

Jul 15 '22 16:07 micheleriva

Going "pseudo-logic":


  private async _insert({ doc, id, language }: QueueDocParams): Promise<void> {

    const index = this.index;

    this.docs.set(id, doc);



    function recursiveTrieInsertion(doc: object) {

      for (const key in doc) {

        if (typeof (doc as any)[key] === "object") {

          throw ERRORS.UNSUPPORTED_NESTED_PROPERTIES();

          // @todo nested properties will be supported with the nexrt versions

          // recursiveTrieInsertion((doc as any)[key]);

        } else if (typeof (doc as any)[key] === "string") {

          const requestedTrie = index.get(key);

          const tokens = tokenize((doc as any)[key], language);

++        const tokenFrequency = getTokenFrequency(tokens); // <-------- new function to count token frequency



          for (const token of tokens) {

--            requestedTrie?.insert(token, id);

++            requestedTrie?.insert(token, [id, frequency]); // <------ just an idea, it doesn't have to be an array at all costs

          }

        }

      }

    }



    recursiveTrieInsertion(doc);

  }

I would evolve around this idea to keep the time complexity as low as possible

Totally agree. I had thought that the best solution was to work directly on the composing Tries action.

I would like to play with your pseudo-logic and if the result is interesting I might do a PR, removing the "sort" in favour of #13.

Jul 15 '22 17:07 mateonunez

Cannot be implemented via sort of a score?

tf-idf can be a good candidate.

Jul 15 '22 18:07 DanieleFedeli

@DanieleFedeli yes, tf-idf could be the way to go. When inserting a new document, we do that by creating a queue and indexing one doc at a time. That means that on very large documents, we could spend some more time analyzing the tokens using tf-idf or a similar technique (Jaccard index, etc.)

Jul 16 '22 07:07 micheleriva

FYI I'm working on a prototype with tf-idf, let's see how it goes

Jul 16 '22 07:07 micheleriva

FYI, the work is proceeding on https://github.com/nearform/lyra/pull/63

Jul 30 '22 10:07 micheleriva

Should this ticket be closed?

Oct 19 '22 14:10 tlahav

orama orama copied to clipboard

Count term occurrencies in document

orama
orama copied to clipboard