feat: PreProcessor split by token (tiktoken & Hugging Face)
Related Issues
Implements #4983. This allows a user to split documents by token count using the PreProcessor class.
Proposed Changes:
Interface:
The user can choose any of ["token", "word", "sentence", "passage"] as the split by parameter. The user passes the string "tiktoken" or any Hugging Face tokenizer (either the model name or a PreTrainedTokenizerBase object) as the new tokenizer parameter (default is tiktoken).
Just like with word splitting, split_respect_sentence_boundary can be set with split_by="token" and will prevent a sentence from being split into different documents.
Implementation:
Kept the changes minimal and used the existing code wherever possible. The main addition is the _split_tokens method, which returns a list of tokens (in string representation) given a text and tokenizer. This is called in _split_into_units when the sentence boundaries are not respected. I generalized _split_into_words_respecting_sent_boundary to _split_into_units_respecting_sent_boundary (which is used when sentence boundaries are respected) and added a split_function parameter.
How did you test it?
Added two unit test in test_preprocessor.py, test_preprocess_tiktoken_token_split, and test_preprocess_huggingface_token_split. Both of them test
(1) that no document has more than split_length tokens in non-sentence-keeping mode by resplitting with the appropriate tokenizer and asserting len(d) <= split_length.
(2) that there are no more documents than sentences when in sentence-keeping mode
Notes for the reviewer
Lossy tokenizers will lead to changes in the text.
Example: splitting a document with the text "This is a document with a long sentence (longer than my split length), it has seventeen words." using bert-base-uncased will remove whitespaces. Joining it back together using " ".join(text) will result in "this is a document with a long sentence ( longer than my split length ) , it has seventeen words .". There seems to be no perfect solution for this. Although the tokenizers have the method tokenizer.convert_tokens_to_string, which tries to do this. To keep the changes minimal for now (and reuse existing methods), I am doing split_at.join(text) with split_at = "" if any(" " in e for e in elements) else " ". (The upside is that when the same tokenizer is used for token splitting and NLP, the tokens will probably be the same both times).
The PreProcessor docs should be extended if this gets merged. I can suggest an edit when merged.
Checklist
✓ I have read the contributors guidelines and the code of conduct
✓ I have updated the related issue with new insights and changes
✓ I added unit tests and updated the docstrings
✓ I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
✓ I documented my code
✓ I ran pre-commit hooks and fixed any issue
Hey @benheckmann, tests are failing because we don't allow HTTP requests in our unit tests. You will need to mock the tokenizer in the unit tests, and/or make an integration test to test against a real tokenizer.
Oh right, sorry @ZanSara! My test is loading the tokenizer from Hugging Face's model hub. I will mock it.
Pull Request Test Coverage Report for Build 6968825217
Warning: This coverage report may be inaccurate.
We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report. To ensure accuracy in future PRs, please see these guidelines. A quick fix for this PR: rebase it; your next report should be accurate.
- 0 of 0 changed or added relevant lines in 0 files are covered.
- 3114 unchanged lines in 81 files lost coverage.
- Overall coverage decreased (-7.0%) to 40.167%
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| nodes/file_converter/base.py | 1 | 72.73% |
| environment.py | 2 | 90.38% |
| nodes/prompt/invocation_layer/hugging_face.py | 3 | 94.19% |
| preview/components/audio/init.py | 3 | 0.0% |
| preview/components/retrievers/init.py | 3 | 0.0% |
| nodes/other/route_documents.py | 4 | 86.84% |
| nodes/preprocessor/base.py | 4 | 81.82% |
| nodes/reader/farm.py | 4 | 40.0% |
| nodes/summarizer/base.py | 4 | 60.0% |
| agents/base.py | 5 | 95.93% |
| <!-- | Total: | 3114 |
| Totals | |
|---|---|
| Change from base Build 5811510771: | -7.0% |
| Covered Lines: | 10656 |
| Relevant Lines: | 26529 |
💛 - Coveralls
@ZanSara Fixed the checks.
@ZanSara Would be great if you could take a look :)
@ZanSara Fixed the lossy tokenizer behavior by using the return_offsets_mapping kwarg and then splitting the text at the offsets. I didn't know about this kwarg when I first created this PR, but now HF tokenizers also split as expected 👍🏼
@benheckmann You will also need to make an extra small step and use Reno to document your changes. No worries it's really simple: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#release-notes I can also help if you have trouble with it.