improvement: remove redundant lowercase and punctuation checks
The _stem method performs unnecessary checks that are already handled upstream:
- Punctuation check: remove_non_alphanumeric() already removes all punctuation
- Lowercase conversion: SimpleTokenizer.tokenize() already converts to lowercase
These redundant operations can be safely removed without affecting functionality.
All Submissions:
- [x] Have you followed the guidelines in our Contributing document?
- [x] Have you checked to ensure there aren't other open Pull Requests for the same update/change?
New Feature Submissions:
- [x] Does your submission pass the existing tests?
- [x] Have you added tests for your feature?
- [x] Have you installed
pre-commitwithpip3 install pre-commitand set up hooks withpre-commit install?
New models submission:
- [x] Have you added an explanation of why it's important to include this model?
- [x] Have you added tests for the new model? Were canonical values for tests computed via the original model?
- [x] Have you added the code snippet for how canonical values were computed?
- [x] Have you successfully ran tests with your changes locally?
📝 Walkthrough
Walkthrough
The _stem method within the Bm25 class was updated to streamline its token processing logic. The revised implementation eliminates the explicit conversion of tokens to lowercase and the check for punctuation, assuming that tokens are already in lowercase and do not require punctuation filtering. The method now filters tokens solely based on their presence in a stopwords set, while retaining the existing logic for token length checks and stemming. No changes were made to the signatures of any exported or public entities.
✨ Finishing Touches
- [ ] 📝 Generate Docstrings
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
🪧 Tips
Chat
There are 3 ways to chat with CodeRabbit:
- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
I pushed a fix in commit <commit_id>, please review it.Explain this complex logic.Open a follow-up GitHub issue for this discussion.
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitaiin a new review comment at the desired location with your query. Examples:@coderabbitai explain this code block.@coderabbitai modularize this function.
- PR comments: Tag
@coderabbitaiin a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.@coderabbitai read src/utils.ts and explain its main purpose.@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.@coderabbitai help me debug CodeRabbit configuration file.
Support
Need help? Create a ticket on our support page for assistance with any issues or questions.
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (Invoked using PR comments)
@coderabbitai pauseto pause the reviews on a PR.@coderabbitai resumeto resume the paused reviews.@coderabbitai reviewto trigger an incremental review. This is useful when automatic reviews are disabled for the repository.@coderabbitai full reviewto do a full review from scratch and review all the files again.@coderabbitai summaryto regenerate the summary of the PR.@coderabbitai generate docstringsto generate docstrings for this PR.@coderabbitai generate sequence diagramto generate a sequence diagram of the changes in this PR.@coderabbitai resolveresolve all the CodeRabbit review comments.@coderabbitai configurationto show the current CodeRabbit configuration for the repository.@coderabbitai helpto get help.
Other keywords and placeholders
- Add
@coderabbitai ignoreanywhere in the PR description to prevent this PR from being reviewed. - Add
@coderabbitai summaryto generate the high-level summary at a specific location in the PR description. - Add
@coderabbitaianywhere in the PR title to generate the title automatically.
CodeRabbit Configuration File (.coderabbit.yaml)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yamlfile to the root of your repository. - Please see the configuration documentation for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
Documentation and Community
- Visit our Documentation for detailed information on how to use CodeRabbit.
- Join our Discord Community to get help, request features, and share feedback.
- Follow us on X/Twitter for updates and announcements.
@coderabbitai Wait, I might've found a small bug. At least in typescript my regex does not filter out _ (underscore). Could you double check that if that goes well here?
[!TIP] For best results, initiate chat on the files or code changes.
🧩 Analysis chain
@dejoma You're absolutely right to be concerned about this! Let me investigate the regex pattern being used.
Based on my analysis of the code, I can see that the remove_non_alphanumeric() function is being used in the tokenization process. The most common regex patterns for "alphanumeric" filtering are:
[^a-zA-Z0-9]- This removes everything except letters and numbers (underscore would be removed)[^a-zA-Z0-9_]- This keeps letters, numbers, and underscores (underscore would be kept)[^\w]- This keeps word characters including underscore (underscore would be kept)
From the test I ran, if the regex uses [^a-zA-Z0-9], then underscores would be filtered out. However, if it uses [^\w] or explicitly includes _, then underscores would be kept.
You mentioned that in TypeScript your regex doesn't filter out underscores, which suggests your TypeScript version might be using a pattern that includes underscores (like [^\w]), while the Python version might be using the stricter [^a-zA-Z0-9] pattern.
This could indeed be a bug if the expectation is that underscores should be filtered out consistently across implementations. Could you share what the expected behavior should be for underscores in this context? Should they be removed as punctuation or kept as valid characters?
(\ /) ( ..) o(")(")
🏁 Script executed:
#!/bin/bash
# First, let's look at the current implementation of the bm25.py file
cat fastembed/sparse/bm25.py
Length of output: 12112
🏁 Script executed:
#!/bin/bash
# Let's search for the remove_non_alphanumeric function definition
rg -A 10 "def remove_non_alphanumeric" fastembed/
Length of output: 209
🏁 Script executed:
#!/bin/bash
# Let's search for any regex patterns that might be used for filtering
rg -A 5 -B 5 "re\.|regex|pattern" fastembed/sparse/bm25.py
Length of output: 917
🏁 Script executed:
#!/bin/bash
# Let's also search for remove_non_alphanumeric usage
rg -A 5 -B 5 "remove_non_alphanumeric" fastembed/sparse/bm25.py
Length of output: 1301
🏁 Script executed:
#!/bin/bash
# Let's search more broadly for the function definition
rg -A 10 "remove_non_alphanumeric"
Length of output: 4223
🏁 Script executed:
#!/bin/bash
# Let's test the regex pattern commonly used for alphanumeric filtering in Python
python3 -c "
import re
# Test common alphanumeric regex patterns
test_text = 'hello_world123!@#$%^&*()'
print('Original text:', test_text)
# Common pattern 1: Only keep alphanumeric
pattern1 = re.compile(r'[^a-zA-Z0-9]')
result1 = pattern1.sub('', test_text)
print('Pattern [^a-zA-Z0-9] result:', result1)
# Common pattern 2: Keep alphanumeric and underscore
pattern2 = re.compile(r'[^a-zA-Z0-9_]')
result2 = pattern2.sub('', test_text)
print('Pattern [^a-zA-Z0-9_] result:', result2)
# Common pattern 3: \w (word characters - includes underscore)
pattern3 = re.compile(r'[^\w]')
result3 = pattern3.sub('', test_text)
print('Pattern [^\w] result:', result3)
# Test specifically with underscore
underscore_test = 'test_underscore_123'
print('\nTesting with underscore text:', underscore_test)
print('Pattern [^a-zA-Z0-9] result:', re.sub(r'[^a-zA-Z0-9]', '', underscore_test))
print('Pattern [^a-zA-Z0-9_] result:', re.sub(r'[^a-zA-Z0-9_]', '', underscore_test))
print('Pattern [^\w] result:', re.sub(r'[^\w]', '', underscore_test))
"
Length of output: 1526
@dejoma Great catch! You're absolutely right to be concerned about this inconsistency.
I found the issue! The remove_non_alphanumeric function in fastembed/common/utils.py uses this regex pattern:
def remove_non_alphanumeric(text: str) -> str:
return re.sub(r"[^\w\s]", " ", text, flags=re.UNICODE)
The pattern [^\w\s] means "replace everything that is NOT a word character or whitespace with a space". In Python's regex, \w includes:
- Letters (a-z, A-Z)
- Digits (0-9)
- Underscores (_)
So underscores are NOT being filtered out in the current Python implementation. My test confirms this:
hello_world123!@#$%^&*()becomeshello_world123(underscore preserved)test_underscore_123staystest_underscore_123(underscores preserved)
This creates an inconsistency with your TypeScript implementation where underscores are filtered out. If the expected behavior is to treat underscores as punctuation that should be removed, then the regex should be changed to:
def remove_non_alphanumeric(text: str) -> str:
return re.sub(r"[^a-zA-Z0-9\s]", " ", text, flags=re.UNICODE)
This would filter out underscores along with other punctuation, making it consistent with your TypeScript implementation.
Should I help you create an issue to track this bug, or would you like to address it in this current PR?
(\ /) ( o.O) o_(")(")