azure-search-openai-demo Embeddings vector dimensions mismatch indexer error

trafficstars

This issue is for a: (mark with an `x`)

- [X] bug report -> please search issues before submitting

Minimal steps to reproduce

Set .env variables as follows: AZURE_OPENAI_EMB_DEPLOYMENT="text-embedding-3-large" AZURE_OPENAI_EMB_DEPLOYMENT_CAPACITY=350 AZURE_OPENAI_EMB_DEPLOYMENT_VERSION=1 AZURE_OPENAI_EMB_DIMENSIONS=1536 USE_FEATURE_INT_VECTORIZATION="true"

Then do azd up

Any log messages given by the failure

When the indexer tries to run, it fails with this:

There's a mismatch in vector dimensions. The vector field 'embedding', with dimension of '1536', expects a length of '1536'. However, the provided vector has a length of '3072'. Please ensure that the vector length matches the expected length of the vector field. Read the following documentation for more details: https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-configure-compression-storage.

When inspecting the code for gptkbindex-skillset in the portal, I notice this bit of code:

{
  "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
  "name": "#2",
  "description": "Skill to generate embeddings via Azure OpenAI",
  "context": "/document/pages/*",
  "resourceUri": "https://cog-trnz2cbjn4ofs.openai.azure.com",
  "apiKey": null,
  "deploymentId": "text-embedding-3-large",
  "dimensions": null,
  "modelName": null

So dimensions and modelName are null. Additonnally, there is this warning in a banner above the code:

This skillset contains an AzureOpenAIEmbedding Skill created by previous API versions that doesn't include the 'modelName' field. We recommend you to migrate by adding 'experimental' value automatically to the field to restore full portal functionality.

If I manually change the skillset code in the portal with this, it works:

      "dimensions": 1536,
      "modelName": "text-embedding-3-large",

I tried to change the code in integratedvectorizerstrategy.py to this:

        import os
        embeddingDimensions = int(os.getenv('AZURE_OPENAI_EMB_DIMENSIONS'))
        embeddingModelName = os.getenv('AZURE_OPENAI_EMB_MODEL_NAME')

        embedding_skill = AzureOpenAIEmbeddingSkill(
            description="Skill to generate embeddings via Azure OpenAI",
            context="/document/pages/*",
            resource_uri=f"https://{self.embeddings.open_ai_service}.openai.azure.com",
            deployment_id=self.embeddings.open_ai_deployment,
            dimensions=embeddingDimensions,
            modelName=embeddingModelName,
            inputs=[
                InputFieldMappingEntry(name="text", source="/document/pages/*"),
            ],
            outputs=[OutputFieldMappingEntry(name="embedding", target_name="vector")],
        )

However, for some reason, this doesn't change the code for the skillset that I see in the portal, even if I delete the skillset completely to make sure that it gets regenerated.

Expected/desired behavior

No indexer error.

OS and Version?

Windows 11

azd version?

azd version 1.9.5 (commit cd2b7af9995d358aab33c782614f801ac1997dde)

Versions

I merged the last commit from 2024-07-16 (main #1789) into my local fork. So I do have some local code modifications but AFAIK, none that would affect this.

Jul 16 '24 20:07 DuboisABB

Ok so I just read in the doc that integrated vectorization is incompatible with the newer embedding models: https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/docs/deploy_features.md#enabling-authentication

However, MS docs seem to indicate that it's indeed compatible: https://learn.microsoft.com/en-us/azure/search/search-get-started-portal-import-vectors?tabs=sample-data-storage%2Cmodel-aoai

So I guess that this bug report is turning into a feature request.

Jul 16 '24 21:07 DuboisABB

Further investigation, it looks like class AzureOpenAIEmbeddingSkill doesnt support dimensions or model_name, in .venv\Lib\site-packages\azure\search\documents\indexes_generated\models_models_py3.py

However, the documentation for that class mentions that it should support that: https://learn.microsoft.com/en-us/python/api/azure-search-documents/azure.search.documents.indexes.models.azureopenaiembeddingskill?view=azure-python-preview

So for some reason we are using an old SDK. I'm pushing my knowledge at this point, I have no idea how to use the latest SDK.

Jul 17 '24 15:07 DuboisABB

One more update... I figured out how to get the latest SDK. I changed this line in requirements.txt: azure-search-documents==11.6.0b4

Then azd up correctly updates _models_py3.py with the updated AzureOpenAIEmbeddingSkill class. Then, this modified code seems to work (just added dimensions and model_name parameters):

        import os
        embeddingDimensions = int(os.getenv('AZURE_OPENAI_EMB_DIMENSIONS'))
        embeddingModelName = os.getenv('AZURE_OPENAI_EMB_MODEL_NAME')

        embedding_skill = AzureOpenAIEmbeddingSkill(
            description="Skill to generate embeddings via Azure OpenAI",
            context="/document/pages/*",
            resource_uri=f"https://{self.embeddings.open_ai_service}.openai.azure.com",
            deployment_id=self.embeddings.open_ai_deployment,
            dimensions=embeddingDimensions,
            model_name=embeddingModelName,
            inputs=[
                InputFieldMappingEntry(name="text", source="/document/pages/*"),
            ],
            outputs=[OutputFieldMappingEntry(name="embedding", target_name="vector")],
        )

However, I still get this error at the prepdocs.py step:

File "C:\Programming\bochat9.venv\Lib\site-packages\azure\search\documents\indexes_generated\aio\operations_indexes_operations.py", line 192, in create raise HttpResponseError(response=response, model=error) azure.core.exceptions.HttpResponseError: () The request is invalid. Details: definition : Error in Vectorizer 'gptkbindex-vectorizer' : 'modelName' parameter is required in API version '2024-05-01-preview'. Code: Message: The request is invalid. Details: definition : Error in Vectorizer 'gptkbindex-vectorizer' : 'modelName' parameter is required in API version '2024-05-01-preview'.

Jul 17 '24 16:07 DuboisABB

One more required change, we need to add model_name to the call to AzureOpenAIVectorizer:

        await search_manager.create_index(
            vectorizers=[
                AzureOpenAIVectorizer(
                    name=f"{self.search_info.index_name}-vectorizer",
                    kind="azureOpenAI",
                    azure_open_ai_parameters=AzureOpenAIParameters(
                        resource_uri=f"https://{self.embeddings.open_ai_service}.openai.azure.com",
                        deployment_id=self.embeddings.open_ai_deployment,
                        model_name=embeddingModelName,  # Added this line to be compatible with API version '2024-05-01-preview'
                    ),
                ),
            ]
        )

I also had forgotten that I also changed this bit in strategy.py:

    def create_search_indexer_client(self) -> SearchIndexerClient:
        return SearchIndexerClient(endpoint=self.endpoint, credential=self.credential, api_version="2024-05-01-preview")

Now prepdocs.py runs without errors.

Jul 17 '24 18:07 DuboisABB

I deployed the app with the new "text-embedding-3-large" supporting 3072 dimensions and had no problems with it.

It is not only important that the skillset of the indexer is set up for the correct dimensions, but also that the "field embedding" of the index is set up for these 3072 dimensions.

It should work if you set the env variable "AZURE_OPENAI_EMB_DIMENSIONS=3072" before running azd up

Aug 01 '24 07:08 christopher-mierbach

I have deployed with the defaults and still get the warning in the portal when viewing the skillset "gptkbindex-skillset" "This skillset contains an AzureOpenAIEmbedding Skill created by previous API versions that doesn't include the 'modelName' field. We recommend you to migrate by adding 'experimental' value automatically to the field to restore full portal functionality." Seems like code base does not set the "modelName" properly when creating, does it?

https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/prepdocslib/integratedvectorizerstrategy.py#L84

https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/prepdocslib/integratedvectorizerstrategy.py#L95

Oct 03 '24 06:10 cforce

I've replicated that warning, I think we need to update our azure-search-documents package to be able to specify that. Working on it.

Oct 03 '24 17:10 pamelafox

PR here: https://github.com/Azure-Samples/azure-search-openai-demo/pull/2045

Oct 16 '24 23:10 pamelafox

azure-search-openai-demo azure-search-openai-demo copied to clipboard

Embeddings vector dimensions mismatch indexer error

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

azd version?

Versions

azure-search-openai-demo
azure-search-openai-demo copied to clipboard

This issue is for a: (mark with an `x`)