azure-search-openai-demo icon indicating copy to clipboard operation
azure-search-openai-demo copied to clipboard

Embeddings vector dimensions mismatch indexer error

Open DuboisABB opened this issue 7 months ago • 7 comments

This issue is for a: (mark with an x)

- [X] bug report -> please search issues before submitting

Minimal steps to reproduce

Set .env variables as follows: AZURE_OPENAI_EMB_DEPLOYMENT="text-embedding-3-large" AZURE_OPENAI_EMB_DEPLOYMENT_CAPACITY=350 AZURE_OPENAI_EMB_DEPLOYMENT_VERSION=1 AZURE_OPENAI_EMB_DIMENSIONS=1536 USE_FEATURE_INT_VECTORIZATION="true"

Then do azd up

Any log messages given by the failure

When the indexer tries to run, it fails with this:

There's a mismatch in vector dimensions. The vector field 'embedding', with dimension of '1536', expects a length of '1536'. However, the provided vector has a length of '3072'. Please ensure that the vector length matches the expected length of the vector field. Read the following documentation for more details: https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-configure-compression-storage.

When inspecting the code for gptkbindex-skillset in the portal, I notice this bit of code:

{
  "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
  "name": "#2",
  "description": "Skill to generate embeddings via Azure OpenAI",
  "context": "/document/pages/*",
  "resourceUri": "https://cog-trnz2cbjn4ofs.openai.azure.com",
  "apiKey": null,
  "deploymentId": "text-embedding-3-large",
  "dimensions": null,
  "modelName": null

So dimensions and modelName are null. Additonnally, there is this warning in a banner above the code:

This skillset contains an AzureOpenAIEmbedding Skill created by previous API versions that doesn't include the 'modelName' field. We recommend you to migrate by adding 'experimental' value automatically to the field to restore full portal functionality.

If I manually change the skillset code in the portal with this, it works:

      "dimensions": 1536,
      "modelName": "text-embedding-3-large",

I tried to change the code in integratedvectorizerstrategy.py to this:

        import os
        embeddingDimensions = int(os.getenv('AZURE_OPENAI_EMB_DIMENSIONS'))
        embeddingModelName = os.getenv('AZURE_OPENAI_EMB_MODEL_NAME')

        embedding_skill = AzureOpenAIEmbeddingSkill(
            description="Skill to generate embeddings via Azure OpenAI",
            context="/document/pages/*",
            resource_uri=f"https://{self.embeddings.open_ai_service}.openai.azure.com",
            deployment_id=self.embeddings.open_ai_deployment,
            dimensions=embeddingDimensions,
            modelName=embeddingModelName,
            inputs=[
                InputFieldMappingEntry(name="text", source="/document/pages/*"),
            ],
            outputs=[OutputFieldMappingEntry(name="embedding", target_name="vector")],
        )

However, for some reason, this doesn't change the code for the skillset that I see in the portal, even if I delete the skillset completely to make sure that it gets regenerated.

Expected/desired behavior

No indexer error.

OS and Version?

Windows 11

azd version?

azd version 1.9.5 (commit cd2b7af9995d358aab33c782614f801ac1997dde)

Versions

I merged the last commit from 2024-07-16 (main #1789) into my local fork. So I do have some local code modifications but AFAIK, none that would affect this.

DuboisABB avatar Jul 16 '24 20:07 DuboisABB