azure-search-openai-demo
azure-search-openai-demo copied to clipboard
Understand file structure in blob storage
Please provide us with the following information:
This issue is for a: (mark with an x)
- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [x] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)
Minimal steps to reproduce
N/A
Any log messages given by the failure
N/A
Expected/desired behavior
Maintain hierarchical namespace in blob storage
OS and Version?
N/A
azd version?
N/A
Versions
N/A
Mention any other details that might be useful
Stellar job, @pamelafox, thanks so much. I am trying to understand why the file structure of the ingested data is flattened in blob storage. Is this necessary? This seems to also be enforced when resources are created: I get an error if I point to an existing storage account with hierarchical namespace enabled. However, blob storage for user uploaded content must have hierarchical namespace enabled (expected).
Any insights would be much appreciated. Thanks again.
Have you tried the options to use an existing ADLS2 container? https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/docs/login_and_acl.md#azure-data-lake-storage-gen2-setup That option was added specifically when we added login/acess control. Or are you trying to use ADLS2 without access control?
The ADLS2 accounts (hierarchical namespace) and Blob storage accounts have two separate Python packages, so we have to implement support for them differently. We've only implemented support for ADLS2 in the way the docs describe above.
Without access control, for now. I saw the referenced example before but it seems to apply to the sample data only with the directory structure (and groups) defined in sampleacls.json. While I could generate a similar json file for my own data, it seems unnecessary. Again, this is ignoring access control.
Looking deeper into this, I wonder if adjusting the file url used to upload to the cloud to include the relative path of the local file, hence capturing the directory it is in, would maintain the file structure on the cloud.
More important is capturing this path and not just the file name in the index on AI search.
For context, I am experimenting with an RAG workflow where file hierarchy is important for queries. For example, given this structure:
.
├── Department A
│ ├── file1.pdf
│ ├── file2.pdf
│ └── file3.pdf
└── Department B
├── file4.pdf
├── file5.pdf
└── file6.pdf
I am trying to explore ways that would lead to AI search retrieving files/content from one or the other department and the GPT model using only these files for its answers.
There are a few files related to ADLS2 in the project: adlsgen2setup.py in scripts and listfilestrategy.py in prepdocs. I think you dont need adlsgen2setup since you dont care about ACLs and you have an existing ADLS2 container. Instead, you need to modify the ADLSGen2ListFileStrategy to not care about acls/oids. As of last week, we do now store the storageUrl as a field, so you could use that in your search/queries, but you could also add a new field to reflect just the directory. You could also use the "Category" field for storing the path, since we don't assume what that's used for.
Understood. I can take it from here. Many thanks again, @pamelafox!