langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Atlassian Confluence support

Open tonyphoang opened this issue 1 year ago • 20 comments

Does langchain have Atlassian Confluence support like Llama Hub?

tonyphoang avatar Apr 06 '23 05:04 tonyphoang

There are examples on using the Llama Hub code and converting the documents to the LangChain document format:

https://github.com/emptycrown/llama-hub/tree/main

from llama_index import GPTSimpleVectorIndex, download_loader
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

# load documents
GoogleDocsReader = download_loader('GoogleDocsReader')
gdoc_ids = ['1wf-y2pd9C878Oh-FmLH7Q_BQkljdm6TQal-c1pUfrec']
loader = GoogleDocsReader()
documents = loader.load_data(document_ids=gdoc_ids)
langchain_documents = [d.to_langchain_format() for d in documents]

# initialize sample QA chain
llm = OpenAI(temperature=0)
qa_chain = load_qa_chain(llm)
question="<query here>"
answer = qa_chain.run(input_documents=langchain_documents, question=question)

Substitute the GoogleDocsReader() for the Confluence reader code, and then you can use the langchain_documents to hookup to a VectorIndex and use the different chains that can interface.

Jflick58 avatar Apr 10 '23 15:04 Jflick58

While solution suggested by @Jflick58 works, I think adding official Atlassian Confluence support still remain as a good idea.

amicus-veritatis avatar Apr 14 '23 16:04 amicus-veritatis

I actually have been working on modifying the Llamahub code into a Confluence LangChain document loader. It's just been on my backlog.

For my own use case, I've been trying to implement an on-the-fly confluence retrieval tool, so that one could do a search for keywords in the prompt against the Confluence API, and then parse and vectorize the relevant pages on-the-fly; and use them in a document retrieval chain or as a tool for a LangChain agent. Otherwise you'd have to constantly re-index the Confluence space you are connecting to.

Jflick58 avatar Apr 14 '23 17:04 Jflick58

@hwchase17 happy to formally take this if you want to assign this issue to me.

Jflick58 avatar Apr 14 '23 17:04 Jflick58

hey @Jflick58 just came across this issue, i added the confluence loader to llamahub, and my friend and I just started working on adding jira and confluence tools to langchain. there might be some overlap with what you're working on

do you have a branch for the use case you described? and is the 'search for keywords in the prompt against the Confluence API' step using CQL?

zywilliamli avatar Apr 15 '23 05:04 zywilliamli

@zywilliamli yeah I opened a PR for the document loader: #2965

The retrieval tool that I am working on (no branch currently published, have just been fooling around with it in notebooks) is based on CQL. I've found that searching the prompts directly with CQL is not returning the best results. Hence, I've been playing with generating a list of keywords from the prompt using NLTK or SpaCy, and then using CQL to search for those.

So far, the on-the-fly download, parsing, and vectorizing is actually not terribly slow. I'd like to get the document loader merged in first, then open a new PR with the code using the document loader methods directly.

Jflick58 avatar Apr 16 '23 05:04 Jflick58

This is quite awesome, thanks! Unfortunately, the Confluence instance of my interest is protected with a rate-limit. It would be awesome if the loader could handle a retries after some time (with some backoff algorithm).

icereed avatar Apr 19 '23 08:04 icereed

@icereed I'll work on that. Should be able to be handled pretty easily with a decorator.

Jflick58 avatar Apr 19 '23 14:04 Jflick58

Amazing 🤩

icereed avatar Apr 19 '23 14:04 icereed

Hello @Jflick58,

Awesome document loader that you created. I had been working on one myself as well before LangChain got created :)

Issue: html2text is GPL licensed which is banned in many companies. Solution: Would you be open to replacing it with BeautifulSoup instead?

text_maker = html2text.HTML2Text()

Something like: BeautifulSoup(text, 'lxml').get_text() ?

Another idea I had implemented is using aiohttp + asyncio to speed up confluence pages loading. I will see if I can submit a MR this week to make it faster. If you are also open to it, of course :)

Sincerely, Theau

theauheral avatar Apr 23 '23 22:04 theauheral

Another issue, from experience, get_all_pages_from_space is limited at around 50 or 100 pages when using expand="body.storage.value". In the current state, it looks like it won't load more than that.

The workaround I have found is to first find the number of pages in the space using exponential search + binary search at the end and then process batches of the limit number of pages in parallel using aiohttp + asyncio.

As mentioned, I will see if I can contribute my code later this week :)

theauheral avatar Apr 23 '23 22:04 theauheral

Apologies for the delay. Fully in support of these enhancements. Thanks!

Jflick58 avatar Apr 25 '23 20:04 Jflick58

Also apologies, I've been a bit busy so I haven't gotten to make much progress on the Confluence tool. @zywilliamli did you experiment with it at all? I saw your Jira tool, that looks really good.

Jflick58 avatar Apr 25 '23 20:04 Jflick58

@Jflick58 submitted this PR today. Tested locally and it was working well. I saw another PR fixing the max pages problem.

https://github.com/hwchase17/langchain/pull/3526

I will create another enhancement to use tokens too.

theauheral avatar Apr 25 '23 20:04 theauheral

Alright, so I've began working on the tool... my original thought was to use the JQL prompt from the Jira toolkit as a template, then include logic to leverage the confluence document loader, take in embeddings and vectorstore objects, and use those to embed the docs we retrieve on-the-fly. Then, the confluence doc search action would first look in the documents we've already indexed before calling out to the Confluence API.

I'm not sure if this makes sense - the other toolkit examples I see tend to be dealing with a smaller amount of data. @zywilliamli do you have any insight?

Branch here if needed, very much WIP: https://github.com/Jflick58/langchain/tree/confluence-tool

Jflick58 avatar Apr 28 '23 23:04 Jflick58

@Jflick58 I'll need to think about this a bit more. What's the workflow or usage you had in mind? In my mind publishing a page from a Langchain agent prompt sounds a little too coarse, unless the agent can allow for some kind of on the fly editing? if it's just for on the fly retrieval and q&a what you described sounds pretty good (if cql results prove to be relevant enough)

zywilliamli avatar May 01 '23 06:05 zywilliamli

I'm struggling with getting the CQL to actually return relevant results. It's really the bottleneck right now. The flow I thought I was that the tool could call CQL with some permutation of the human prompt, load the results, and then ideally vectorize it with embeddings on-the-fly to reduce the total size per doc and enable vector similarity searches.

The alternative is accepting that the CQL search experience is not great, and that folks will just need to periodically re-index their confluence spaces with the document loader + a vector store and use a retrieval chain from the vectorstore.

Jflick58 avatar May 05 '23 19:05 Jflick58

Are you observing worse relevance when using the cql langchain tool when compared to just doing cql search (or normal search) on confluence web ui (for the same query)? As far as I'm aware they should go through the same search pipeline, so it'll be interesting if youre seeing a difference.

CQL/confluence search currently doesn't support semantic search, so if you're passing in a fully natural language query that might perform badly. An idea to make it more relevant is to have the tool rewrite it in a more elasticsearch friendly way. E.g go from 'are there any policies on python documentation conventions for the search team?' to 'team search python documentation convention'

Also a note on the periodic re-index comment: you don't need to reindex the whole space, cql lets you fetch all the new and/or updated pages given a time period, so you can just refresh the stale documents periodically, which would save on compute, time and tokens.

zywilliamli avatar May 09 '23 02:05 zywilliamli

Yeah, I'm seeing better outcomes with the web ui. If I type in "what is our __________ policy?" into the web ui search, the top result is the page that delineates that policy, and it appears to search across all spaces I have access to.

Using CQL, with query = "text~'what is our ________ policy?') I get totally irrelevant results. Is it the fuzzy search?

Jflick58 avatar May 15 '23 23:05 Jflick58

thanks that's good to know, i'll ask around what difference is in ranking implementation between the two are and how to get the regular search relevance through api

zywilliamli avatar May 20 '23 09:05 zywilliamli

Any progress on this? I'm interested in exploring this feature

dcieslak19973 avatar Jun 11 '23 16:06 dcieslak19973

Hi, Did you look at the pagination of the data load for conluence?

v2br avatar Jul 24 '23 05:07 v2br

I would use llama index and langchain with a vector DB like Pinecone and then you have full semantic search not need to worry about CQL and keywords. The fly in the oinment which is needed as an extended function is refreshing changed docs and upserting on either an instant change or at least on a timer, so that you have a data source sync for your docstore. I have the frst part woking fine, just not the refreshing capability done yet as thats a lot harder.

ochapple avatar Sep 05 '23 23:09 ochapple

Hi, @tonyphoang! I'm helping the LangChain team manage their backlog and am marking this issue as stale.

From what I understand, the issue is inquiring about the availability of Atlassian Confluence support similar to Llama Hub in langchain. There have been discussions and examples provided by contributors, along with suggestions for enhancements and challenges with CQL search relevance. The issue remains unresolved at this time.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

dosubot[bot] avatar Dec 06 '23 17:12 dosubot[bot]