unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/VOYAGE embedding models supported but not available in PIPELINE

Open jeremydiba opened this issue 7 months ago • 0 comments

Describe the bug In Pipeline -> EmbedderConfig, every embedding model documented here https://docs.unstructured.io/open-source/core-functionality/embedding#voyageaiembeddingencoder is supported except for Voyage throws an error as being not recognized

To Reproduce

from unstructured.ingest.v2.pipeline.pipeline import Pipeline
from unstructured.ingest.v2.interfaces import ProcessorConfig
from unstructured.ingest.v2.processes.connectors.fsspec.s3 import (
    S3IndexerConfig,
    S3DownloaderConfig,
    S3ConnectionConfig,
    S3AccessConfig,
    S3UploaderConfig
)
from unstructured.ingest.v2.processes.partitioner import PartitionerConfig
from unstructured.ingest.v2.processes.chunker import ChunkerConfig
from unstructured.ingest.v2.processes.embedder import EmbedderConfig
pipeline = Pipeline.from_configs(
    context=ProcessorConfig(),
    indexer_config=S3IndexerConfig(remote_url=INPUT_S3_FILE),
    downloader_config=S3DownloaderConfig(download_dir="s3-ingest-download"),
    source_connection_config=S3ConnectionConfig(
        access_config=S3AccessConfig(
            key="AWS_ACCESS_KEY_ID",
            secret="AWS_SECRET_ACCESS_KEY",
            token="AWS_SESSION_TOKEN"
        )
    ),
    partitioner_config=PartitionerConfig(
        partition_by_api=True,
        api_key="UNSTRUCTURED_API_KEY_AUTH",
        partition_endpoint="UNSTRUCTURED_SERVER_URL",
        strategy="auto"
    ),
    chunker_config=ChunkerConfig(chunking_strategy="by_title",
                                chunk_combine_text_under_n_chars=100,
                                chunk_include_orig_elements=False,
                                chunk_max_characters=4000),
    embedder_config=EmbedderConfig(embedding_provider="Voyage",
                                   embedding_api_key="VOYAGE_API_KEY",
                                   embedding_model_name="voyage-law-2"),
    destination_connection_config=S3ConnectionConfig(
        access_config=S3AccessConfig(
            key="AWS_ACCESS_KEY_ID",
            secret="AWS_SECRET_ACCESS_KEY",
            token="AWS_SESSION_TOKEN"
        )
    ),
    uploader_config=S3UploaderConfig(remote_url=OUTPUT_S3_FILEPATH)
)

Expected behavior Support for VoyageAIEmbeddingEncoder / Voyage to be a valid parameter If support is not intended, there should be indication in the documentation that this is available functionality only when ran outside the pipeline

Screenshots If applicable, add screenshots to help explain your problem.

Environment Info Python 3.11 ValueError: Voyage not a recognized encoder

Additional context Add any other context about the problem here.

jeremydiba avatar Jul 19 '24 17:07 jeremydiba