fastembed
fastembed copied to clipboard
feat: add public tokenize method for TextEmbedding class
Expose tokenize method instead of using embedding_model.model.tokenize.
All Submissions:
- [x] Have you followed the guidelines in our Contributing document?
- [x] Have you checked to ensure there aren't other open Pull Requests for the same update/change?
New Feature Submissions:
- [x] Does your submission pass the existing tests?
- [ ] Have you added tests for your feature?
- [x] Have you installed
pre-commitwithpip3 install pre-commitand set up hooks withpre-commit install?
New models submission:
- [x] Have you added an explanation of why it's important to include this model?
- [ ] Have you added tests for the new model? Were canonical values for tests computed via the original model?
- [ ] Have you added the code snippet for how canonical values were computed?
- [x] Have you successfully ran tests with your changes locally?
๐ Walkthrough
Walkthrough
This PR adds a public tokenize API across multiple embedding classes and their bases. Base classes (TextEmbeddingBase, LateInteractionTextEmbeddingBase, LateInteractionMultimodalEmbeddingBase, SparseTextEmbeddingBase) receive a tokenize method (raising NotImplementedError). Dense/late-interaction/onnx implementations delegate to underlying model tokenizers and return list[Encoding]. Sparse implementations add tokenize stubs returning dict[str, Any] and currently raise NotImplementedError. Several model-level methods were renamed/redispatched (e.g., public tokenize โ private _tokenize in some ONNX/multimodal modules). Tests for tokenize were added for multiple model types.
Estimated code review effort
๐ฏ 3 (Moderate) | โฑ๏ธ ~25 minutes
- Areas needing extra attention:
- ColBERT refactor: removal of
is_docfrom public signature and addition of_tokenizedelegator. - Consistency of return types between dense (
list[Encoding]) and sparse (dict[str, Any]) tokenizers and callers. - Visibility changes: public
tokenizeโ private_tokenizein ONNX / multimodal modules and updated call sites. - Tests added for tokenization behavior (validate CI cleanup and model-specific assumptions).
- New NotImplementedError stubs in sparse classes (ensure callers expect/handle this).
- ColBERT refactor: removal of
Possibly related PRs
- qdrant/fastembed#513 โ Modifies MiniCOIL implementation; relates to this PR which added a
tokenizestub tominicoil.py.
Suggested reviewers
- generall
Pre-merge checks and finishing touches
โ Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | โ ๏ธ Warning | Docstring coverage is 22.81% which is insufficient. The required threshold is 80.00%. | You can run @coderabbitai generate docstrings to improve docstring coverage. |
โ Passed checks (2 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | โ Passed | The title accurately describes the main feature being added: a public tokenize method exposed on the TextEmbedding class instead of requiring access to the underlying model. |
| Description check | โ Passed | The description clearly states the purpose of the PR - exposing a public tokenize method on TextEmbedding class rather than using embedding_model.model.tokenize, and includes relevant submission checklist items. |
โจ Finishing touches
- [ ] ๐ Generate docstrings
๐งช Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
- [ ] Commit unit tests in branch
feat/expose-tokenize-method
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
We've decided not to expose tokenize, but just provide token_count