semantic-router
semantic-router copied to clipboard
feat: Add SpaCy as a pre-splitter for Rolling Window - Fix #193
- add SpaCy as an option for pre-splitting;
- add check for "en_core_web_sm" pipeline: if not present, install
hey @klein-t — thanks for the PR this is a great idea, I have just some small feedback points on making
spacy
an optional dependency before we merge — lmk what you think, thanks!
Hey @jamescalam, all made sense. I implemented the changes. Let me know if all is good, I was really short on time so I did things in the quickest way I could think of. All is tested, but still, let me know how you feel about it.
Cheers,
Klein
Hey @klein-t please fix the poetry lock issue and add the tests that you mentioned to the test_splitters.py file.
Additionally, some tests that can be added:
from semantic_router.splitters.rolling_window import RollingWindowSplitter
def test_split_documents_rolling_window_splitter():
# Mock the BaseEncoder
mock_encoder = Mock()
# Simulate encoding by returning an array of vectors
mock_encoder.return_value = np.array(
[[0.2, 0.8], [0.2, 0.8], [1, 0], [0, 1], [0, 0], [0.2, 0.8]]
)
cohere_encoder = CohereEncoder(
name="",
cohere_api_key="a",
input_type="",
)
test_doc = [
"""
The ancient oak tree. The tree is a silent witness to centuries, stood majestically at the crest of the rolling hill, its leaves whispering the secrets of a bygone era.
The quick brown fox jumps over the lazy dog.
Innovative technologies in renewable energy are revolutionizing the way we power our cities, significantly reducing the carbon footprint and fostering a sustainable future.
The art of sushi-making demands precision and patience, with each roll a delicate balance of flavors, textures, and aesthetics, reflecting the culinary heritage of Japan.
"""
]
splitter = RollingWindowSplitter(
encoder=cohere_encoder, window_size=5, min_split_tokens=1
)
splitter.encoder = mock_encoder
splits = splitter(test_doc)
print(splits)
assert len(splits) == 3
def test_split_documents_rolling_window_splitter_with_spacy():
# Mock the BaseEncoder
mock_encoder = Mock()
# Simulate encoding by returning an array of vectors
mock_encoder.return_value = np.array(
[[0.2, 0.8], [0.2, 0.8], [1, 0], [0, 1], [0, 0], [0.2, 0.8]]
)
cohere_encoder = CohereEncoder(
name="",
cohere_api_key="a",
input_type="",
)
test_doc = [
"""
The ancient oak tree. The tree is a silent witness to centuries, stood majestically at the crest of the rolling hill, its leaves whispering the secrets of a bygone era.
The quick brown fox jumps over the lazy dog.
Innovative technologies in renewable energy are revolutionizing the way we power our cities, significantly reducing the carbon footprint and fostering a sustainable future.
The art of sushi-making demands precision and patience, with each roll a delicate balance of flavors, textures, and aesthetics, reflecting the culinary heritage of Japan.
"""
]
splitter = RollingWindowSplitter(
encoder=cohere_encoder, window_size=5, min_split_tokens=1, pre_splitter="spacy"
)
splitter.encoder = mock_encoder
splits = splitter(test_doc)
print(splits)
assert len(splits) == 3
Besides that, everything works correctly.
Hey @klein-t please fix the poetry lock issue and add the tests that you mentioned to the test_splitters.py file.
Additionally, some tests that can be added:
from semantic_router.splitters.rolling_window import RollingWindowSplitter def test_split_documents_rolling_window_splitter(): # Mock the BaseEncoder mock_encoder = Mock() # Simulate encoding by returning an array of vectors mock_encoder.return_value = np.array( [[0.2, 0.8], [0.2, 0.8], [1, 0], [0, 1], [0, 0], [0.2, 0.8]] ) cohere_encoder = CohereEncoder( name="", cohere_api_key="a", input_type="", ) test_doc = [ """ The ancient oak tree. The tree is a silent witness to centuries, stood majestically at the crest of the rolling hill, its leaves whispering the secrets of a bygone era. The quick brown fox jumps over the lazy dog. Innovative technologies in renewable energy are revolutionizing the way we power our cities, significantly reducing the carbon footprint and fostering a sustainable future. The art of sushi-making demands precision and patience, with each roll a delicate balance of flavors, textures, and aesthetics, reflecting the culinary heritage of Japan. """ ] splitter = RollingWindowSplitter( encoder=cohere_encoder, window_size=5, min_split_tokens=1 ) splitter.encoder = mock_encoder splits = splitter(test_doc) print(splits) assert len(splits) == 3 def test_split_documents_rolling_window_splitter_with_spacy(): # Mock the BaseEncoder mock_encoder = Mock() # Simulate encoding by returning an array of vectors mock_encoder.return_value = np.array( [[0.2, 0.8], [0.2, 0.8], [1, 0], [0, 1], [0, 0], [0.2, 0.8]] ) cohere_encoder = CohereEncoder( name="", cohere_api_key="a", input_type="", ) test_doc = [ """ The ancient oak tree. The tree is a silent witness to centuries, stood majestically at the crest of the rolling hill, its leaves whispering the secrets of a bygone era. The quick brown fox jumps over the lazy dog. Innovative technologies in renewable energy are revolutionizing the way we power our cities, significantly reducing the carbon footprint and fostering a sustainable future. The art of sushi-making demands precision and patience, with each roll a delicate balance of flavors, textures, and aesthetics, reflecting the culinary heritage of Japan. """ ] splitter = RollingWindowSplitter( encoder=cohere_encoder, window_size=5, min_split_tokens=1, pre_splitter="spacy" ) splitter.encoder = mock_encoder splits = splitter(test_doc) print(splits) assert len(splits) == 3
Besides that, everything works correctly.
Hey @mesax1, thanks. I will do it later today/tomorrow!