DocsGPT icon indicating copy to clipboard operation
DocsGPT copied to clipboard

Support for remote info stores like, website, Confluence, Sharepoint etc,

Open emanueol opened this issue 2 years ago • 15 comments

Or must all files exist locally ?

In real world of large enterprises, theres a confluence server, a jira server, sharepoint server that typically reside in a data center or as a SaaS cloud, and some on-prem custom html, excel files etc.

Would be great a ChatGPT type of ingest/compute searches across remote systems.. how feasible is this ? Thanks

emanueol avatar Feb 06 '23 12:02 emanueol

Thats a good feature we can start working on, as loang as we can prep it in a neat and readable format we can ingest it all. But I do think we have to vectorise all this data first or summarise at least.

dartpain avatar Feb 07 '23 00:02 dartpain

I was able to pull confluence data by exporting the whole space as HTML (in a zip), extracting it to the docsgpt folder, changing the glob pattern in ingest_rst.py to point to the extracted HTML files, and using BeautifulSoup to pull out the text content inside the main

tags (soup.find_all('div', attrs={"class": "wiki-content group"})). I couldn't get the API to download a whole space at once, but the manual method worked decently, as long as there isn't a huge number of spaces to deal with.

terrafying avatar Feb 07 '23 02:02 terrafying

I suppose theres 2 things:

initial export (Confluence allows exporting a space in couple different formats).

incremental updates (Jira supports webhooks , sending json via http post to some listener)

Or maybe easier to prep a clone so you gpt stuff doesn't interfer with users, but it all boils down to file formats.

im not specialist of under the hood of Confluence, Jira, etc.. but im part of those 0.001% that care centralizing knowledge for both tech guys (majority) and satisfy business/stakeholders. Its a very interesting point of discussion "hiw to decide organization of information, but i suppose some basic high level stuff could be inserted on the source pages (tags, etc) to help with categorization, as least avoid showing code to business manager when he just looking for the 5 business rules agreed with Development in some project. etc.

emanueol avatar Feb 07 '23 19:02 emanueol

https://platform.openai.com/docs/tutorials/web-qa-embeddings

Could this be used?

JohnRSim avatar Mar 22 '23 08:03 JohnRSim

I think what needs to be built here is just a module for our parser. basically something that loads data and converts it into (.rst, .md, .pdf, .docx, .csv, .epub, .html) As DocsGPT already loads this files with easy. Its just method of scraping files that we need to implement here

dartpain avatar Mar 22 '23 09:03 dartpain

This feature would be a killer. TBH I gave it a try and it's a more complex proeblem than I thought. OpenAI example would try to fetch every URL on the website and put every text in embeddings. The problem is that there is a lot of gibberish, non relvant text on these webpages.

We could start by something simple, like Wikipedia. There should be some good projects already doing webscraping on wikipedia in Python. But then again I think it's a big development effort. I think other projects like AutoGPT do something similar already.

tardigrde avatar Jun 16 '23 17:06 tardigrde

FYI: This nice repo implemented loading text from YT videos as well as websites https://github.com/embedchain/embedchain

tardigrde avatar Jun 28 '23 16:06 tardigrde

Deeplake could be the good fit? https://github.com/activeloopai/deeplake

KennyDizi avatar Oct 12 '23 06:10 KennyDizi

I think this would fit more of a auto fine tune situation. We need a more general solution such that we can ingest data in different vectorstores, if users want to use faiss or elasticsearch or pinecone...

dartpain avatar Oct 12 '23 17:10 dartpain

Also @pabik is already working on it in the feature/remote-loads branch We will also need to do the UI in parallel

dartpain avatar Oct 12 '23 17:10 dartpain

Llama Index provides support for ingestion from these sources. We can either look into integrating or porting these.

  • Confluence: https://llamahub.ai/l/readers/llama-index-readers-confluence?from=
  • Notion: https://llamahub.ai/l/readers/llama-index-readers-notion?from=all
  • Google Docs & Sheets: https://llamahub.ai/l/readers/llama-index-readers-google?from=all

All loaders - https://llamahub.ai/?tab=readers

thefoodiecoder avatar Mar 20 '24 13:03 thefoodiecoder

Currently we have some remote loaders presend, but not confluence and sharepoint check them out here: https://github.com/arc53/DocsGPT/tree/main/application/parser/remote If you want to contribute, would be very happy!

dartpain avatar Mar 20 '24 13:03 dartpain

Python isn't my forte but let me see if someone from my Data Science team is willing to contribute.

thefoodiecoder avatar Mar 20 '24 13:03 thefoodiecoder