haystack
haystack copied to clipboard
Accept Callables as Tokenizers for InMemoryDocumentStore
Discussed in https://github.com/deepset-ai/haystack/discussions/4695
Originally posted by farhanhubble April 18, 2023
InMemoryDocumentStore
currently only accepts a tokenizing pattern through the argument bm25_tokenization_regex: str = r"(?u)\b\w\w+\b"
. The underlying BM25 supports a callable
though. Removing this restriction will enable correct tokenization of a larger variety of corpora. I ran into this limitation trying to index JSON documents that contain key-value pairs, like:
"casNumber": "96-80-0",
"concentration": "<= 100 %",
"detailedConcentration": {
"approximate": false,
"fromConcentration": 0,
"fromOperand": "",
"remainder": false,
"toConcentration": 100.0,
"toOperand": "<=",
"unavailable": false
},
"ecNumber": "202-536-2",
"molecularFormula": "C8H19NO",
"molecularWeight": "145.24 g/mol"
```</div>
How do we plan to accept callables? By defining__call__
meathod in class InMemoryDocumentStore
OR to make a separate class in of tokenization and define __call__
? Or any other apporach is suggested?
If we rename bm25_tokenization_regex
to something like bm25_tokenize_with
to accept either a regex or a callable, it breaks backwards compatibility.
If we introduce a new param it'll muddle up things and we'd need to ensure that both of them don't get used simultaneously.
Will try to help with this one :)