fastembed feat: add public tokenize method for TextEmbedding class

Expose tokenize method instead of using embedding_model.model.tokenize.

All Submissions:

[x] Have you followed the guidelines in our Contributing document?
[x] Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

[x] Does your submission pass the existing tests?
[ ] Have you added tests for your feature?
[x] Have you installed pre-commit with pip3 install pre-commit and set up hooks with pre-commit install?

New models submission:

[x] Have you added an explanation of why it's important to include this model?
[ ] Have you added tests for the new model? Were canonical values for tests computed via the original model?
[ ] Have you added the code snippet for how canonical values were computed?
[x] Have you successfully ran tests with your changes locally?

Oct 21 '25 15:10 dancixx

📝 Walkthrough

Walkthrough

This PR adds a public tokenize API across multiple embedding classes and their bases. Base classes (TextEmbeddingBase, LateInteractionTextEmbeddingBase, LateInteractionMultimodalEmbeddingBase, SparseTextEmbeddingBase) receive a tokenize method (raising NotImplementedError). Dense/late-interaction/onnx implementations delegate to underlying model tokenizers and return list[Encoding]. Sparse implementations add tokenize stubs returning dict[str, Any] and currently raise NotImplementedError. Several model-level methods were renamed/redispatched (e.g., public tokenize → private _tokenize in some ONNX/multimodal modules). Tests for tokenize were added for multiple model types.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Areas needing extra attention:
- ColBERT refactor: removal of is_doc from public signature and addition of _tokenize delegator.
- Consistency of return types between dense (list[Encoding]) and sparse (dict[str, Any]) tokenizers and callers.
- Visibility changes: public tokenize → private _tokenize in ONNX / multimodal modules and updated call sites.
- Tests added for tokenization behavior (validate CI cleanup and model-specific assumptions).
- New NotImplementedError stubs in sparse classes (ensure callers expect/handle this).

Possibly related PRs

qdrant/fastembed#513 — Modifies MiniCOIL implementation; relates to this PR which added a tokenize stub to minicoil.py.

Suggested reviewers

generall

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.81% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main feature being added: a public tokenize method exposed on the TextEmbedding class instead of requiring access to the underlying model.
Description check	✅ Passed	The description clearly states the purpose of the PR - exposing a public tokenize method on TextEmbedding class rather than using embedding_model.model.tokenize, and includes relevant submission checklist items.

✨ Finishing touches

[ ] 📝 Generate docstrings

🧪 Generate unit tests (beta)

[ ] Create PR with unit tests
[ ] Post copyable unit tests in a comment
[ ] Commit unit tests in branch feat/expose-tokenize-method

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Oct 21 '25 15:10 coderabbitai[bot]

We've decided not to expose tokenize, but just provide token_count

Nov 10 '25 17:11 joein

fastembed fastembed copied to clipboard

feat: add public tokenize method for TextEmbedding class

All Submissions:

New Feature Submissions:

New models submission:

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

fastembed
fastembed copied to clipboard