semantic-router icon indicating copy to clipboard operation
semantic-router copied to clipboard

feat: Add SpaCy as a pre-splitter for Rolling Window - Fix #193

Open klein-t opened this issue 11 months ago • 3 comments

  • add SpaCy as an option for pre-splitting;
  • add check for "en_core_web_sm" pipeline: if not present, install

klein-t avatar Mar 15 '24 13:03 klein-t

hey @klein-t — thanks for the PR this is a great idea, I have just some small feedback points on making spacy an optional dependency before we merge — lmk what you think, thanks!

Hey @jamescalam, all made sense. I implemented the changes. Let me know if all is good, I was really short on time so I did things in the quickest way I could think of. All is tested, but still, let me know how you feel about it.

Cheers,

Klein

klein-t avatar Mar 15 '24 16:03 klein-t

Hey @klein-t please fix the poetry lock issue and add the tests that you mentioned to the test_splitters.py file.

Additionally, some tests that can be added:

from semantic_router.splitters.rolling_window import RollingWindowSplitter

def test_split_documents_rolling_window_splitter():
    # Mock the BaseEncoder
    mock_encoder = Mock()

    # Simulate encoding by returning an array of vectors
    mock_encoder.return_value = np.array(
        [[0.2, 0.8], [0.2, 0.8], [1, 0], [0, 1], [0, 0], [0.2, 0.8]]
    )

    cohere_encoder = CohereEncoder(
        name="",
        cohere_api_key="a",
        input_type="",
    )
    test_doc = [
        """
        The ancient oak tree. The tree is a silent witness to centuries, stood majestically at the crest of the rolling hill, its leaves whispering the secrets of a bygone era.
        The quick brown fox jumps over the lazy dog.
        Innovative technologies in renewable energy are revolutionizing the way we power our cities, significantly reducing the carbon footprint and fostering a sustainable future.
        The art of sushi-making demands precision and patience, with each roll a delicate balance of flavors, textures, and aesthetics, reflecting the culinary heritage of Japan.
        """
    ]

    splitter = RollingWindowSplitter(
        encoder=cohere_encoder, window_size=5, min_split_tokens=1
    )
    splitter.encoder = mock_encoder
    splits = splitter(test_doc)
    print(splits)
    assert len(splits) == 3


def test_split_documents_rolling_window_splitter_with_spacy():
    # Mock the BaseEncoder
    mock_encoder = Mock()

    # Simulate encoding by returning an array of vectors
    mock_encoder.return_value = np.array(
        [[0.2, 0.8], [0.2, 0.8], [1, 0], [0, 1], [0, 0], [0.2, 0.8]]
    )

    cohere_encoder = CohereEncoder(
        name="",
        cohere_api_key="a",
        input_type="",
    )
    test_doc = [
        """
        The ancient oak tree. The tree is a silent witness to centuries, stood majestically at the crest of the rolling hill, its leaves whispering the secrets of a bygone era.
        The quick brown fox jumps over the lazy dog.
        Innovative technologies in renewable energy are revolutionizing the way we power our cities, significantly reducing the carbon footprint and fostering a sustainable future.
        The art of sushi-making demands precision and patience, with each roll a delicate balance of flavors, textures, and aesthetics, reflecting the culinary heritage of Japan.
        """
    ]

    splitter = RollingWindowSplitter(
        encoder=cohere_encoder, window_size=5, min_split_tokens=1, pre_splitter="spacy"
    )
    splitter.encoder = mock_encoder
    splits = splitter(test_doc)
    print(splits)
    assert len(splits) == 3

Besides that, everything works correctly.

mesax1 avatar Mar 21 '24 06:03 mesax1

Hey @klein-t please fix the poetry lock issue and add the tests that you mentioned to the test_splitters.py file.

Additionally, some tests that can be added:

from semantic_router.splitters.rolling_window import RollingWindowSplitter

def test_split_documents_rolling_window_splitter():
    # Mock the BaseEncoder
    mock_encoder = Mock()

    # Simulate encoding by returning an array of vectors
    mock_encoder.return_value = np.array(
        [[0.2, 0.8], [0.2, 0.8], [1, 0], [0, 1], [0, 0], [0.2, 0.8]]
    )

    cohere_encoder = CohereEncoder(
        name="",
        cohere_api_key="a",
        input_type="",
    )
    test_doc = [
        """
        The ancient oak tree. The tree is a silent witness to centuries, stood majestically at the crest of the rolling hill, its leaves whispering the secrets of a bygone era.
        The quick brown fox jumps over the lazy dog.
        Innovative technologies in renewable energy are revolutionizing the way we power our cities, significantly reducing the carbon footprint and fostering a sustainable future.
        The art of sushi-making demands precision and patience, with each roll a delicate balance of flavors, textures, and aesthetics, reflecting the culinary heritage of Japan.
        """
    ]

    splitter = RollingWindowSplitter(
        encoder=cohere_encoder, window_size=5, min_split_tokens=1
    )
    splitter.encoder = mock_encoder
    splits = splitter(test_doc)
    print(splits)
    assert len(splits) == 3


def test_split_documents_rolling_window_splitter_with_spacy():
    # Mock the BaseEncoder
    mock_encoder = Mock()

    # Simulate encoding by returning an array of vectors
    mock_encoder.return_value = np.array(
        [[0.2, 0.8], [0.2, 0.8], [1, 0], [0, 1], [0, 0], [0.2, 0.8]]
    )

    cohere_encoder = CohereEncoder(
        name="",
        cohere_api_key="a",
        input_type="",
    )
    test_doc = [
        """
        The ancient oak tree. The tree is a silent witness to centuries, stood majestically at the crest of the rolling hill, its leaves whispering the secrets of a bygone era.
        The quick brown fox jumps over the lazy dog.
        Innovative technologies in renewable energy are revolutionizing the way we power our cities, significantly reducing the carbon footprint and fostering a sustainable future.
        The art of sushi-making demands precision and patience, with each roll a delicate balance of flavors, textures, and aesthetics, reflecting the culinary heritage of Japan.
        """
    ]

    splitter = RollingWindowSplitter(
        encoder=cohere_encoder, window_size=5, min_split_tokens=1, pre_splitter="spacy"
    )
    splitter.encoder = mock_encoder
    splits = splitter(test_doc)
    print(splits)
    assert len(splits) == 3

Besides that, everything works correctly.

Hey @mesax1, thanks. I will do it later today/tomorrow!

klein-t avatar Mar 21 '24 10:03 klein-t