azure-search-vector-samples icon indicating copy to clipboard operation
azure-search-vector-samples copied to clipboard

Skillset triggered via Indexer is not able to create vector embeddings

Open aayushrajj opened this issue 11 months ago • 0 comments

I have connected a blob storage to azure AI search via indexer creating the required data source, skillset, index and the indexer. I have used two skills: SplitSkill and AzureOpenAIEmbeddingSkill SplitSkill is working properly as I can see in the index documents being split into chunks but no vector emebdding is being generated and the vector embedding fields remais empty.

What could be the reason? I have checked and verified embedding model, skillset and index. I have used code present in the azure github samples.

Skillset Code:

from azure.search.documents.indexes.models import (
    SplitSkill,
    InputFieldMappingEntry,
    OutputFieldMappingEntry,
    AzureOpenAIEmbeddingSkill,
    SearchIndexerIndexProjections,
    SearchIndexerIndexProjectionSelector,
    SearchIndexerIndexProjectionsParameters,
    IndexProjectionMode,
    SearchIndexerSkillset
)

# Create a skillset  
skillset_name = f"{index_name}-skillset"


# Otherwise, use the normal document content.
split_skill_text_source = "/document/content" if not use_ocr else "/document/merged_content"
split_skill = SplitSkill(  
    description="Split skill to chunk documents",  
    text_split_mode="pages",  
    context="/document",  
    maximum_page_length=2000,  
    page_overlap_length=500,  
    inputs=[  
        InputFieldMappingEntry(name="text", source=split_skill_text_source),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="textItems", target_name="pages")  
    ],  
)  
  
embedding_skill = AzureOpenAIEmbeddingSkill(  
    description="Skill to generate embeddings via Azure OpenAI",  
    context="/document/pages/*",  
    resource_uri=azure_openai_endpoint,  
    deployment_id=azure_openai_embedding_deployment,  
    model_name=azure_openai_model_name,
    dimensions=dimenson,
    api_key=model_key,  
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/pages/*"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="embedding", target_name="content_vector")  
    ],  
)  
  
index_projections = SearchIndexerIndexProjections(  
    selectors=[  
        SearchIndexerIndexProjectionSelector(  
            target_index_name=index_name,  
            parent_key_field_name="parent_id",  
            source_context="/document/pages/*",  
            mappings=[  
                InputFieldMappingEntry(name="content", source="/document/pages/*"),  
                InputFieldMappingEntry(name="content_vector", source="/document/pages/*/vector"),  
                InputFieldMappingEntry(name="metadata", source="/document/metadata_storage_name"),  
            ],  
        ),  
    ],  
    parameters=SearchIndexerIndexProjectionsParameters(  
        projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS  
    ),  
) 


skills = [split_skill, embedding_skill]

skillset = SearchIndexerSkillset(  
    name=skillset_name,  
    description="Skillset to chunk documents and generating embeddings",  
    skills=skills,  
    index_projections=index_projections
)
  
client = SearchIndexerClient(endpoint, credential)  
client.create_or_update_skillset(skillset)  
print(f"{skillset.name} created")  

aayushrajj avatar Nov 14 '24 07:11 aayushrajj