fastembed icon indicating copy to clipboard operation
fastembed copied to clipboard

feat: add public tokenize method for TextEmbedding class

Open dancixx opened this issue 1 month ago โ€ข 2 comments

Expose tokenize method instead of using embedding_model.model.tokenize.

All Submissions:

  • [x] Have you followed the guidelines in our Contributing document?
  • [x] Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  • [x] Does your submission pass the existing tests?
  • [ ] Have you added tests for your feature?
  • [x] Have you installed pre-commit with pip3 install pre-commit and set up hooks with pre-commit install?

New models submission:

  • [x] Have you added an explanation of why it's important to include this model?
  • [ ] Have you added tests for the new model? Were canonical values for tests computed via the original model?
  • [ ] Have you added the code snippet for how canonical values were computed?
  • [x] Have you successfully ran tests with your changes locally?

dancixx avatar Oct 21 '25 15:10 dancixx

๐Ÿ“ Walkthrough

Walkthrough

This PR adds a public tokenize API across multiple embedding classes and their bases. Base classes (TextEmbeddingBase, LateInteractionTextEmbeddingBase, LateInteractionMultimodalEmbeddingBase, SparseTextEmbeddingBase) receive a tokenize method (raising NotImplementedError). Dense/late-interaction/onnx implementations delegate to underlying model tokenizers and return list[Encoding]. Sparse implementations add tokenize stubs returning dict[str, Any] and currently raise NotImplementedError. Several model-level methods were renamed/redispatched (e.g., public tokenize โ†’ private _tokenize in some ONNX/multimodal modules). Tests for tokenize were added for multiple model types.

Estimated code review effort

๐ŸŽฏ 3 (Moderate) | โฑ๏ธ ~25 minutes

  • Areas needing extra attention:
    • ColBERT refactor: removal of is_doc from public signature and addition of _tokenize delegator.
    • Consistency of return types between dense (list[Encoding]) and sparse (dict[str, Any]) tokenizers and callers.
    • Visibility changes: public tokenize โ†’ private _tokenize in ONNX / multimodal modules and updated call sites.
    • Tests added for tokenization behavior (validate CI cleanup and model-specific assumptions).
    • New NotImplementedError stubs in sparse classes (ensure callers expect/handle this).

Possibly related PRs

  • qdrant/fastembed#513 โ€” Modifies MiniCOIL implementation; relates to this PR which added a tokenize stub to minicoil.py.

Suggested reviewers

  • generall

Pre-merge checks and finishing touches

โŒ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage โš ๏ธ Warning Docstring coverage is 22.81% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
โœ… Passed checks (2 passed)
Check name Status Explanation
Title check โœ… Passed The title accurately describes the main feature being added: a public tokenize method exposed on the TextEmbedding class instead of requiring access to the underlying model.
Description check โœ… Passed The description clearly states the purpose of the PR - exposing a public tokenize method on TextEmbedding class rather than using embedding_model.model.tokenize, and includes relevant submission checklist items.
โœจ Finishing touches
  • [ ] ๐Ÿ“ Generate docstrings
๐Ÿงช Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment
  • [ ] Commit unit tests in branch feat/expose-tokenize-method

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

โค๏ธ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Oct 21 '25 15:10 coderabbitai[bot]

We've decided not to expose tokenize, but just provide token_count

joein avatar Nov 10 '25 17:11 joein