langstream How to Use Lucene Analyzers

Lucene provides a robust set of tools to build search indexes and then find documents. In fact, Jonathan used Lucene's vector similarity as the basis for VectorSearch.

For our use cases we can take advantage of Lucene's rich set of Analyzers and Filters - https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/package-summary.html

Solr makes use of Lucene and here is Solr's description on how it uses these features: https://solr.apache.org/guide/solr/latest/indexing-guide/document-analysis.html

This page ties back Solr configuration to Lucene and presents the pattern that is similar to what we are doing in SGA. https://solr.apache.org/guide/solr/latest/indexing-guide/analyzers.html

Jul 26 '23 22:07 dave2wave

There is a dot net version if that excites David Dieruf - https://lucenenet.apache.org/docs/4.8.0-beta00016/

Jul 26 '23 23:07 dave2wave

Also - this class may be helpful - https://solr.apache.org/docs/9_3_0/core/org/apache/solr/analysis/TokenizerChain.html

Jul 26 '23 23:07 dave2wave

@dave2wave

This pointers are helpful. I see that Lucene utilities may be helpful in removing stop words and normalising the text (like dealing with contractions, i.e. "can't" -> "cannot")

Initially we are targeting LLMs, so the main problem is to prepare the data to pass as "context" to LLMS. The main issue here is that you have to limit the number of "tokens" and so you must split big texts into smaller chunks. It is important that we use the same algorithms used by the LLMs (like https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).

In the future we can add more text processing tools based on Lucene in order to refine and clean the documents before sending them to the LLM.

Jul 27 '23 13:07 eolivelli

langstream langstream copied to clipboard

How to Use Lucene Analyzers

langstream
langstream copied to clipboard