Accept Callables as Tokenizers for InMemoryDocumentStore
Discussed in https://github.com/deepset-ai/haystack/discussions/4695
Originally posted by farhanhubble April 18, 2023
InMemoryDocumentStore currently only accepts a tokenizing pattern through the argument bm25_tokenization_regex: str = r"(?u)\b\w\w+\b". The underlying BM25 supports a callable though. Removing this restriction will enable correct tokenization of a larger variety of corpora. I ran into this limitation trying to index JSON documents that contain key-value pairs, like:
"casNumber": "96-80-0",
"concentration": "<= 100 %",
"detailedConcentration": {
"approximate": false,
"fromConcentration": 0,
"fromOperand": "",
"remainder": false,
"toConcentration": 100.0,
"toOperand": "<=",
"unavailable": false
},
"ecNumber": "202-536-2",
"molecularFormula": "C8H19NO",
"molecularWeight": "145.24 g/mol"
```</div>
How do we plan to accept callables? By defining__call__meathod in class InMemoryDocumentStore OR to make a separate class in of tokenization and define __call__? Or any other apporach is suggested?
If we rename bm25_tokenization_regex to something like bm25_tokenize_with to accept either a regex or a callable, it breaks backwards compatibility.
If we introduce a new param it'll muddle up things and we'd need to ensure that both of them don't get used simultaneously.
Will try to help with this one :)