langchain4j icon indicating copy to clipboard operation
langchain4j copied to clipboard

feat(langchain4j-milvus): MilvusEmbeddingStore supports configure required index parameters

Open Glarme opened this issue 10 months ago • 5 comments

Context

Added support for configure common custom index parameters

Fix: #860

Change

The constructor of MilvusEmbeddingStore now require a parameter of type IndexParam and provides parameter models required for common type indexes.

Checklist

Before submitting this PR, please check the following points:

  • [x] I have added unit and integration tests for my change
  • [x] All unit and integration tests in the module I have added/changed are green
  • [ ] All unit and integration tests in the core and main modules are green
  • [ ] I have added/updated the documentation
  • [ ] I have added an example in the examples repo (only for "big" features)
  • [ ] I have added my new module in the BOM (only when a new module is added)

Checklist for adding new embedding store integration

  • [ ] I have added a {NameOfIntegration}EmbeddingStoreIT that extends from either EmbeddingStoreIT or EmbeddingStoreWithFilteringIT

Glarme avatar Apr 12 '24 09:04 Glarme

@langchain4j Is there a configured milvus instance available for CI unit testing? Attempting to connect using MILVUS_URI and MILVUS_API_KEY retrieved from environment variables seems to fail.

Glarme avatar Apr 12 '24 10:04 Glarme

@langchain4j Is there a configured milvus instance available for CI unit testing? Attempting to connect using MILVUS_URI and MILVUS_API_KEY retrieved from environment variables seems to fail.

You can use the @EnabledIfEnvironmentVariable annotation to conditionally enable a test class based on an environment variable. For example:

@EnabledIfEnvironmentVariable(named = "MILVUS_API_KEY", matches = ".+")
class MilvusEmbeddingStoreCloudIT extends EmbeddingStoreWithFilteringIT {
    // Test class content
}

Additionally, I suggest that MilvusEmbeddingStore should employ the 'upsert' operation instead of 'insert', similar to the approach taken by PgVector, Pinecone, and AstraDb. This change would help to alleviate issues arising from id conflicts. Furthermore, @langchain4j the EmbeddingStore interface contains a method public void add(String id, Embedding embedding). Would it be possible to include a public void add(String id, Embedding embedding, TextSegment textSegment), that enables the upserting of a record along with a text field?

jiangsier-xyz avatar Apr 13 '24 09:04 jiangsier-xyz

@langchain4j Is there a configured milvus instance available for CI unit testing? Attempting to connect using MILVUS_URI and MILVUS_API_KEY retrieved from environment variables seems to fail.

You can use the @EnabledIfEnvironmentVariable annotation to conditionally enable a test class based on an environment variable. For example:

@EnabledIfEnvironmentVariable(named = "MILVUS_API_KEY", matches = ".+")
class MilvusEmbeddingStoreCloudIT extends EmbeddingStoreWithFilteringIT {
    // Test class content
}

Additionally, I suggest that MilvusEmbeddingStore should employ the 'upsert' operation instead of 'insert', similar to the approach taken by PgVector, Pinecone, and AstraDb. This change would help to alleviate issues arising from id conflicts. Furthermore, @langchain4j the EmbeddingStore interface contains a method public void add(String id, Embedding embedding). Would it be possible to include a public void add(String id, Embedding embedding, TextSegment textSegment), that enables the upserting of a record along with a text field?

@langchain4j Is there a configured milvus instance available for CI unit testing? Attempting to connect using MILVUS_URI and MILVUS_API_KEY retrieved from environment variables seems to fail.

You can use the @EnabledIfEnvironmentVariable annotation to conditionally enable a test class based on an environment variable. For example:

@EnabledIfEnvironmentVariable(named = "MILVUS_API_KEY", matches = ".+")
class MilvusEmbeddingStoreCloudIT extends EmbeddingStoreWithFilteringIT {
    // Test class content
}

Additionally, I suggest that MilvusEmbeddingStore should employ the 'upsert' operation instead of 'insert', similar to the approach taken by PgVector, Pinecone, and AstraDb. This change would help to alleviate issues arising from id conflicts. Furthermore, @langchain4j the EmbeddingStore interface contains a method public void add(String id, Embedding embedding). Would it be possible to include a public void add(String id, Embedding embedding, TextSegment textSegment), that enables the upserting of a record along with a text field?

This PR mainly aims to add parameters for Index creation. Collection and index creation are performed within the constructor of MilvusEmbeddingStore, contingent upon the collection not existing. Without a Milvus instance, unit tests cannot reach the location of this modification. If skipping the test based on conditions using the @EnabledIfEnvironmentVariable method without a Milvus instance, then this unit test would serve no purpose.

Additionally, regarding modifications to other methods in MilvusEmbeddingStore, it might be worth opening a separate issue for discussion. In addition to your suggestions, I also propose adding a vector deletion interface based on ID or filters to EmbeddingStore, enabling CRUD operations on vectors to be performed within EmbeddingStore. For updates, this can be achieved through a combination of deletion and creation, eliminating the need to create a separate client for maintaining this data functionality.

Glarme avatar Apr 13 '24 13:04 Glarme

The Milvus instance in unit testing has been adjusted to be provided by @Testcontainers, and all unit tests have passed. It's worth noting that configuring Milvus to use @Testcontainers requires additional settings, as shown below.

MilvusContainer milvus = new MilvusContainer("milvusdb/milvus:v2.3.1")
            .withCreateContainerCmdModifier(cmd -> {
                cmd.withHostConfig(HostConfig.newHostConfig()
                        // This security opts needs to be configured;
                        // otherwise, the container will fail to start.
                        // Refer to the parameters from https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh.
                        .withSecurityOpts(Collections.singletonList("seccomp=unconfined"))
                        // By default, the Milvus Docker image doesn't expose the target port.
                        // Manual binding is required to expose it;
                        // otherwise, it will result in a failed detection during the test container inspection,
                        // leading to an inability to connect to the Milvus instance.
                        .withPortBindings(PortBinding.parse("19530:19530"), PortBinding.parse("9091:9091")
                        )
                );
            });

Glarme avatar Apr 19 '24 10:04 Glarme

@langchain4j Is there a configured milvus instance available for CI unit testing? Attempting to connect using MILVUS_URI and MILVUS_API_KEY retrieved from environment variables seems to fail.

No, not for the main CI yet. Only for releases: https://github.com/langchain4j/langchain4j/blob/main/.github/workflows/release.yaml#L53

langchain4j avatar Apr 22 '24 16:04 langchain4j