haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Accept Callables as Tokenizers for InMemoryDocumentStore

Open masci opened this issue 1 year ago • 3 comments

Discussed in https://github.com/deepset-ai/haystack/discussions/4695

Originally posted by farhanhubble April 18, 2023 InMemoryDocumentStore currently only accepts a tokenizing pattern through the argument bm25_tokenization_regex: str = r"(?u)\b\w\w+\b". The underlying BM25 supports a callable though. Removing this restriction will enable correct tokenization of a larger variety of corpora. I ran into this limitation trying to index JSON documents that contain key-value pairs, like:

"casNumber": "96-80-0", 
  "concentration": "<= 100 %", 
  "detailedConcentration": {
    "approximate": false, 
    "fromConcentration": 0, 
    "fromOperand": "", 
    "remainder": false, 
    "toConcentration": 100.0, 
    "toOperand": "<=", 
    "unavailable": false
  }, 
  "ecNumber": "202-536-2", 
  "molecularFormula": "C8H19NO", 
  "molecularWeight": "145.24 g/mol"
```</div>

masci avatar Apr 21 '23 08:04 masci

How do we plan to accept callables? By defining__call__meathod in class InMemoryDocumentStore OR to make a separate class in of tokenization and define __call__? Or any other apporach is suggested?

manulpatel avatar Apr 24 '23 18:04 manulpatel

If we rename bm25_tokenization_regex to something like bm25_tokenize_with to accept either a regex or a callable, it breaks backwards compatibility.

If we introduce a new param it'll muddle up things and we'd need to ensure that both of them don't get used simultaneously.

farhanhubble avatar May 12 '23 05:05 farhanhubble

Will try to help with this one :)

CarlosFerLo avatar May 15 '24 20:05 CarlosFerLo