azure-search-openai-demo icon indicating copy to clipboard operation
azure-search-openai-demo copied to clipboard

Multiple Data Source

Open nhtkid opened this issue 2 years ago • 37 comments

Hey Fam,

Not an issue but I'd like to know how to connect to multiple data sources like SharePoint or the Azure Storage?

Thanks!

nhtkid avatar Jun 02 '23 04:06 nhtkid

Hi @nhtkid I agree with you 100%, please look at https://github.com/Azure-Samples/azure-search-openai-demo/issues/225

itmilos avatar Jun 02 '23 11:06 itmilos

This amazing demo was a great tutorial on how to respond using Enterprise Data, but the approach I used was Microsoft Graph API to retrieve the data from our SharePoint Document Libraries. I had more control this way then using a Sharepoint Indexer. I was able to insert the data in the Azure storage, chunked the data and insert into the Index. I will be creating several indexes for difference departments in our organizations. Our document libraries have pdf, doc, excel so using different libraries to chunk the data prior to inserting into the index (similar to the demo). Using Microsoft Graph API to retrieve documents that are marked for indexing will also allow us to upload changes to Azure Blob and Index. Would love to hear what others are doing.

aymiee avatar Jun 07 '23 14:06 aymiee

Microsoft is preparing to launch plugins in coming days. These plugins will enable users to seamlessly connect with a wide range of data stores, including databases and other relevant sources of information.

vrajroutu avatar Jun 07 '23 15:06 vrajroutu

@aymiee is there any way I could contact you with a couple of questions regarding your approach to multiple types of documents and their integrability with OpenAI service?

kristofrabay avatar Jun 10 '23 16:06 kristofrabay

Sure, my linkedIn: https://www.linkedin.com/in/aymiee-lee-1ab2a486/

aymiee avatar Jun 11 '23 05:06 aymiee

@aymiee would you like to share your implementation?

vrajroutu avatar Jun 11 '23 05:06 vrajroutu

Hello @aymiee , I sent you a request on LinkedIn too (I am Tessa). I am extending the same things as you are. Right now I support pdf, docx, ppt(x) and xlsx. However I still need to implement the Microsoft Graph API. I would love to connect and exchange thoughts and issues with each other. Thank you.

Tesax123 avatar Jun 12 '23 17:06 Tesax123

Hi @Tesax123 Can you please share us the implementation. How you did for other file formats as well.? My Linkedin : https://www.linkedin.com/in/dilipkumar5/

yadavdilip183 avatar Jun 14 '23 12:06 yadavdilip183

Hi @Tesax123 Can you please share us the implementation. How you did for other file formats as well.?

Writing from my other account. I can tell you that I personally used python-docx, python-pptx and openpyxl libraries for other file supports. Implementation very similar to the pypdf code that is already there. Good luck

tickx-cegeka avatar Jun 14 '23 12:06 tickx-cegeka

I just completed integrating SharePoint. All you need to follow this instructions here https://learn.microsoft.com/en-us/azure/search/search-howto-index-sharepoint-online And modify app.py to add another search client and modify approaches to search in SharePoint index as well.

vrajroutu avatar Jun 15 '23 20:06 vrajroutu

Hello @aymiee , I sent you a request on LinkedIn too (I am Tessa).

I am extending the same things as you are. Right now I support pdf, docx, ppt(x) and xlsx.

However I still need to implement the Microsoft Graph API. I would love to connect and exchange thoughts and issues with each other. Thank you.

Hello @Tesax123 how are you ingesting your xslx files?

vrajroutu avatar Jun 15 '23 20:06 vrajroutu

I am sharing my code changes for retrieving data from Sharepoint Document Libraries in hopes that others can assist with my challenges. This is POC and not production ready.

You will need to create a Azure App per https://learn.microsoft.com/en-us/azure/search/search-howto-index-sharepoint-online. You can stop there and just use the Sharepoint Indexer. However, it didn't give me the control I needed. Basically, instead of pointing to a folder, I retrieved my documents from Sharepoint Docoument Library via the Microsoft Graph API.

However, I am faced with challenges. The challenges I have is is chunking up PDF's, docx, xlsx and saving in the index to be served to ChatGPT. PDF's and docx work fine; however, when chunking up a table in excel:

  1. I had to make sure that only values not formulas were in the cells. I'm using openpyxl which does not evaluate formulas very well. Need to look into pyexcelerate or xlwings.
  2. Even though, the values were saved and inserted into the index, there's no frame of reference ..ie column names. When prompt ChatGPT, results were inconsistent.
  3. So my currently implementation for excel is indexing each row of data as a separate document with the column names. But, this leads to situations where a row is split across multiple documents, or where a single document contains parts of multiple rows. When you chunk the data in this way, the reference to the original rows (e.g., the identifiers in column 'A') is lost.

..oh! Make sure your column headings are one one line

To run, (back up your prepdocs.py and use mine), just type prepdocs.ps1. While testing, I toggle between setting this args.removeall from True to False delete the blobs and index.

aymiee avatar Jun 21 '23 19:06 aymiee

Hi @aymiee,

ACS just had an update and now you can connect to your own data in the Playground.

https://techcommunity.microsoft.com/t5/ai-cognitive-services-blog/introducing-azure-openai-service-on-your-data-in-public-preview/ba-p/3847000

Not only that, you could generate the code and deploy the app directly within the Studio.

As suggested in the comment of above link from a MS employee, direct connection with SPO is not yet supported. However, it now supports Azure Blob and ACS Index.

I am asking whether you could just turn the SPO content to ACS Index so you could connect to it as a workaround?

nhtkid avatar Jun 23 '23 04:06 nhtkid

I just completed integrating SharePoint. All you need to follow this instructions here https://learn.microsoft.com/en-us/azure/search/search-howto-index-sharepoint-online And modify app.py to add another search client and modify approaches to search in SharePoint index as well.

Did you have to apply the preview for SharePoint Index in ACS?

nhtkid avatar Jun 23 '23 04:06 nhtkid

Thanks @nhtkid, when I tried the Sharepoint Indexer a few weeks, it worked great! but I didn't have the control I needed. At the time, I couldn't

  1. Point to a specific folder in a document library.
  2. I couldn't specify what files I wanted indexed. I don't want all files indexed. There could be 1000's. Right now, we created a column in Sharepoint called IndexYN. If it's Y, then the Graph API picks up the file for insertion into the blob and index.
  3. The Sharepoint Indexer is an all or nothing.... I only want files that have been modified and can do this with the Graph API. I couldn't do this with the indexer. Perhaps I need to look at the property called dataChangeDetectionPolicy in the datasource of the Indexer?
  4. And what about chunking? When using the Sharepoint Indexer, how are you handling the chunking before you insert it into the index? I got an error regarding length when prompting with ChatGPT (Azure OpenAI)

I wrote this as a POC with Microsoft Graph API in an azure function and also modified it in the python demo for ease of use and to get feedback. But like I said, I tried the Sharepoint Indexer weeks ago, if it has changed to allow for those issues above, I definitely need to re-examine Sharepoint Indexer. Please let me know. I loved the ease of the Sharepoint Indexer.

My challenge remains with excel. Ours are massive workbooks. I can specify what sheets need to be Indexed if I code it manually. Does the Sharepoint Indexer handle excel better?

EDIT: I reconnected Sharepoint Indexer so I can add it as a Data Source in the Chat PlayGround, works well for docx and pdf but excel files are a mess.

Please let me know your experiences. Thanks again.

aymiee avatar Jun 23 '23 16:06 aymiee

Hi @aymiee. I'm Debanshu Ganguly, I sent you a connection request in Linkedin. I have the sharepoint indexer set up in azure cognitive search service and the index has data. I just a bit of help figuring out how to modify the existing code to reference the sharepoint index instead of the predefined one so that I can query my pdfs in sharepoint without having to download them or convert them to blobs and store them in a storage service. Thanks in advance.

dGanguly1 avatar Jul 04 '23 06:07 dGanguly1

Hi @dGanguly1 So you created a Sharepoint Indexer and you have that indexer point to an index that you created. You want to use this new index instead of the default "gptkbindex"? I believe these are the areas where gptkbindex is referenced (and you might check to make sure - depending on the version you have of this demo):

  1. .env file : set the environment variable: AZURE_SEARCH_INDEX="gptkbindex"
  2. \app\backend\app.py : there's a reference to the environment variables. If you already set it in step 1, you do not need to set it here but might as well.
  3. \infra\main.bicep: this creates the index for gptkbindex , so you may leave this alone since you have created your own Sharepoint indexer.
  4. There are two other locations that gptkbindex but they are in the Jupyter Notebooks. They point to the environment variables. So if you change it in step 1, you'd be okay here.

Hope this helps and hopefully I didn't miss any other location.

Note: In my example, I linked my Sharepoint Indexer to the existing index "gptkbindex" that way I don't have to make all that code change. It's up to you. Linking the indexer to the index will not automatically create the blob in the storage service. The indexer only crawls a datasource (like sharepoint) and pulls that data into an index.

By the way, In the Chat Playground, you could link to a ACS datasource, search service and index w/o having to go through all of the above. I haven't looked into deploying from the Chat Playground. Looks interesting....

I didn't end up using Sharepoint Indexer because of the complexity of our excel, I used the Microsoft Graph API to pull data from Sharepoint Document Library and when doing so, I also save the file in the Blob Storage because of the supporting documentation/citation features of this demo.. although, I could have easily point the index's source to the URL of the Sharepoint Document Library and file, but I might have to deal with appropriate permissions, etc. Might be an option to take a look at.

aymiee avatar Jul 04 '23 12:07 aymiee

On a different note... Because you can sync a Sharepoint Document Library to Business's One Drive, I am wondering if you can use this demo as is and point the code to the One Drive folder instead of the /data folder. I wonder if anyone has done that??

aymiee avatar Jul 04 '23 12:07 aymiee

Hi @aymiee , I am not sure whether this security trimming thing can help you refine the control you are after. https://learn.microsoft.com/en-us/azure/search/search-security-trimming-for-azure-search

nhtkid avatar Jul 18 '23 12:07 nhtkid

I just completed integrating SharePoint. All you need to follow this instructions here https://learn.microsoft.com/en-us/azure/search/search-howto-index-sharepoint-online And modify app.py to add another search client and modify approaches to search in SharePoint index as well.

I have created the SP Indexer following the same guide, then I added in as a data source in Chat Playground and deployed as a web app.

However what I found is that it only worked with DOCX files in the document libraries, not PDF or other formats.

nhtkid avatar Jul 21 '23 12:07 nhtkid

I just created this video how I set up the Azure Web App with the SharePoint Library. https://www.linkedin.com/posts/leojwang_generativeai-chatgpt-azureai-activity-7088836179508281344-1lCV?utm_source=share&utm_medium=member_desktop

nhtkid avatar Jul 23 '23 11:07 nhtkid

I'm trying to follow... is there a way to add multiple data sources into the AI Studio chat playground? If there is an index for Sharepoint, blob storage, table storage, etc - can multiple data sources be selected? It appears we can select one data source and only one index when adding data to the assistant.

andrewzamer avatar Jul 27 '23 15:07 andrewzamer

@andrewzamer If your question is about the AI Studio Chat with Your Data app, then https://github.com/Microsoft/sample-app-aoai-chatGPT/ is the repository for that codebase. The discussion here is about this codebase, which doesn't have a corresponding UI in the AI Studio. This codebase has prepdocs.py which can only handle a few different formats.

pamelafox avatar Jul 27 '23 16:07 pamelafox

Great question @andrewzamer! You are right about that.

It seems like currently you can add one data source at a time. It is not supported in the UI. But if you could take the example codes with different data sources, maybe you could combine them. Like you would use multiple data loaders using Langchain or LLAMA Index.

It is still early days and Azure AI is evolving fast. I really need to look into the Vector preview because the current data sources are not good enough. I reckon the embeddings are the way to.

nhtkid avatar Jul 31 '23 10:07 nhtkid

@aymiee, I was reading your comments and looks like you have done a lot of things with azure index and storage. My question is not related to share point but it related to indexer. Did you have any idea if an azure indexer can be used for pushing data from blob storage to cognitive search index?

I tried creating indexer and mapped it with azure blob storage and I was able to transfer data(content) to index using indexer but I don't know how I can create multiple sections of a blob (a single page of pdf file) and attach this chunking process with indexer, so indexer can do all these heavy lifting over here [whenever there is a change in blob it automatically creates chunks of that blob and update that in index].

I can create multiple sections (having ~1000 chars each) of a page and store those sections into index via rest api but I want to do it using indexer, so please let me know if you have any idea on this.

sandeeppatidar30 avatar Aug 11 '23 17:08 sandeeppatidar30

team is there any API call to add multiple index for Azure AI studio , we have one index from share point and another one from storage account and planning to add confluence also ,

https://www.linkedin.com/in/roshith-rajan-92871a14/

roshithrajan avatar Sep 08 '23 16:09 roshithrajan

@roshithrajan Are you referring to modifying this sample or are you using the "Chat on your data" from Azure OpenAI studio? For this sample, many people modify it to have multiple tabs, one for each index. If you're using the Azure OpenAI studio app, please see https://github.com/microsoft/sample-app-aoai-chatGPT/issues/158

pamelafox avatar Sep 08 '23 17:09 pamelafox

in azure openai console is there any we can select the data sources as if we have multiple index in azure cognitive services how do we select that?

image

roshithrajan avatar Sep 09 '23 09:09 roshithrajan

@roshithrajan For questions on the studio "Add your data" feature, please post in their issue tracker: https://github.com/microsoft/sample-app-aoai-chatGPT This codebase is not connected to that.

pamelafox avatar Sep 09 '23 13:09 pamelafox

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.

github-actions[bot] avatar Dec 22 '23 01:12 github-actions[bot]