azure-search-openai-demo
azure-search-openai-demo copied to clipboard
Embeddings vector dimensions mismatch indexer error
This issue is for a: (mark with an x)
- [X] bug report -> please search issues before submitting
Minimal steps to reproduce
Set .env variables as follows: AZURE_OPENAI_EMB_DEPLOYMENT="text-embedding-3-large" AZURE_OPENAI_EMB_DEPLOYMENT_CAPACITY=350 AZURE_OPENAI_EMB_DEPLOYMENT_VERSION=1 AZURE_OPENAI_EMB_DIMENSIONS=1536 USE_FEATURE_INT_VECTORIZATION="true"
Then do azd up
Any log messages given by the failure
When the indexer tries to run, it fails with this:
There's a mismatch in vector dimensions. The vector field 'embedding', with dimension of '1536', expects a length of '1536'. However, the provided vector has a length of '3072'. Please ensure that the vector length matches the expected length of the vector field. Read the following documentation for more details: https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-configure-compression-storage.
When inspecting the code for gptkbindex-skillset in the portal, I notice this bit of code:
{
"@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
"name": "#2",
"description": "Skill to generate embeddings via Azure OpenAI",
"context": "/document/pages/*",
"resourceUri": "https://cog-trnz2cbjn4ofs.openai.azure.com",
"apiKey": null,
"deploymentId": "text-embedding-3-large",
"dimensions": null,
"modelName": null
So dimensions and modelName are null. Additonnally, there is this warning in a banner above the code:
This skillset contains an AzureOpenAIEmbedding Skill created by previous API versions that doesn't include the 'modelName' field. We recommend you to migrate by adding 'experimental' value automatically to the field to restore full portal functionality.
If I manually change the skillset code in the portal with this, it works:
"dimensions": 1536,
"modelName": "text-embedding-3-large",
I tried to change the code in integratedvectorizerstrategy.py to this:
import os
embeddingDimensions = int(os.getenv('AZURE_OPENAI_EMB_DIMENSIONS'))
embeddingModelName = os.getenv('AZURE_OPENAI_EMB_MODEL_NAME')
embedding_skill = AzureOpenAIEmbeddingSkill(
description="Skill to generate embeddings via Azure OpenAI",
context="/document/pages/*",
resource_uri=f"https://{self.embeddings.open_ai_service}.openai.azure.com",
deployment_id=self.embeddings.open_ai_deployment,
dimensions=embeddingDimensions,
modelName=embeddingModelName,
inputs=[
InputFieldMappingEntry(name="text", source="/document/pages/*"),
],
outputs=[OutputFieldMappingEntry(name="embedding", target_name="vector")],
)
However, for some reason, this doesn't change the code for the skillset that I see in the portal, even if I delete the skillset completely to make sure that it gets regenerated.
Expected/desired behavior
No indexer error.
OS and Version?
Windows 11
azd version?
azd version 1.9.5 (commit cd2b7af9995d358aab33c782614f801ac1997dde)
Versions
I merged the last commit from 2024-07-16 (main #1789) into my local fork. So I do have some local code modifications but AFAIK, none that would affect this.
Ok so I just read in the doc that integrated vectorization is incompatible with the newer embedding models: https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/docs/deploy_features.md#enabling-authentication
However, MS docs seem to indicate that it's indeed compatible: https://learn.microsoft.com/en-us/azure/search/search-get-started-portal-import-vectors?tabs=sample-data-storage%2Cmodel-aoai
So I guess that this bug report is turning into a feature request.
Further investigation, it looks like class AzureOpenAIEmbeddingSkill doesnt support dimensions or model_name, in .venv\Lib\site-packages\azure\search\documents\indexes_generated\models_models_py3.py
However, the documentation for that class mentions that it should support that: https://learn.microsoft.com/en-us/python/api/azure-search-documents/azure.search.documents.indexes.models.azureopenaiembeddingskill?view=azure-python-preview
So for some reason we are using an old SDK. I'm pushing my knowledge at this point, I have no idea how to use the latest SDK.
One more update... I figured out how to get the latest SDK. I changed this line in requirements.txt: azure-search-documents==11.6.0b4
Then azd up correctly updates _models_py3.py with the updated AzureOpenAIEmbeddingSkill class. Then, this modified code seems to work (just added dimensions and model_name parameters):
import os
embeddingDimensions = int(os.getenv('AZURE_OPENAI_EMB_DIMENSIONS'))
embeddingModelName = os.getenv('AZURE_OPENAI_EMB_MODEL_NAME')
embedding_skill = AzureOpenAIEmbeddingSkill(
description="Skill to generate embeddings via Azure OpenAI",
context="/document/pages/*",
resource_uri=f"https://{self.embeddings.open_ai_service}.openai.azure.com",
deployment_id=self.embeddings.open_ai_deployment,
dimensions=embeddingDimensions,
model_name=embeddingModelName,
inputs=[
InputFieldMappingEntry(name="text", source="/document/pages/*"),
],
outputs=[OutputFieldMappingEntry(name="embedding", target_name="vector")],
)
However, I still get this error at the prepdocs.py step:
File "C:\Programming\bochat9.venv\Lib\site-packages\azure\search\documents\indexes_generated\aio\operations_indexes_operations.py", line 192, in create raise HttpResponseError(response=response, model=error) azure.core.exceptions.HttpResponseError: () The request is invalid. Details: definition : Error in Vectorizer 'gptkbindex-vectorizer' : 'modelName' parameter is required in API version '2024-05-01-preview'. Code: Message: The request is invalid. Details: definition : Error in Vectorizer 'gptkbindex-vectorizer' : 'modelName' parameter is required in API version '2024-05-01-preview'.
One more required change, we need to add model_name to the call to AzureOpenAIVectorizer:
await search_manager.create_index(
vectorizers=[
AzureOpenAIVectorizer(
name=f"{self.search_info.index_name}-vectorizer",
kind="azureOpenAI",
azure_open_ai_parameters=AzureOpenAIParameters(
resource_uri=f"https://{self.embeddings.open_ai_service}.openai.azure.com",
deployment_id=self.embeddings.open_ai_deployment,
model_name=embeddingModelName, # Added this line to be compatible with API version '2024-05-01-preview'
),
),
]
)
I also had forgotten that I also changed this bit in strategy.py:
def create_search_indexer_client(self) -> SearchIndexerClient:
return SearchIndexerClient(endpoint=self.endpoint, credential=self.credential, api_version="2024-05-01-preview")
Now prepdocs.py runs without errors.
I deployed the app with the new "text-embedding-3-large" supporting 3072 dimensions and had no problems with it.
It is not only important that the skillset of the indexer is set up for the correct dimensions, but also that the "field embedding" of the index is set up for these 3072 dimensions.
It should work if you set the env variable "AZURE_OPENAI_EMB_DIMENSIONS=3072" before running azd up
I have deployed with the defaults and still get the warning in the portal when viewing the skillset "gptkbindex-skillset" "This skillset contains an AzureOpenAIEmbedding Skill created by previous API versions that doesn't include the 'modelName' field. We recommend you to migrate by adding 'experimental' value automatically to the field to restore full portal functionality." Seems like code base does not set the "modelName" properly when creating, does it?
https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/prepdocslib/integratedvectorizerstrategy.py#L84
https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/prepdocslib/integratedvectorizerstrategy.py#L95
I've replicated that warning, I think we need to update our azure-search-documents package to be able to specify that. Working on it.
PR here: https://github.com/Azure-Samples/azure-search-openai-demo/pull/2045