Confluence loader only returns partial `pageContent`
Checked other resources
- [X] I added a very descriptive title to this issue.
- [X] I searched the LangChain.js documentation with the integrated search.
- [X] I used the GitHub search to find a similar question and didn't find it.
- [x] I am sure that this is a bug in LangChain.js rather than my code.
- [X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Example Code
const loader = new ConfluencePagesLoader({
baseUrl: 'xxx',
spaceKey: 'xxx',
username: 'xxx',
accessToken: 'xxx',
});
const docs = await loader.load()
Error Message and Stack Trace (if applicable)
N/A
Description
When using this piece of code to retrieve confluence doc with SQL code blocks, the returned data redacted the sql block like
{
pageContent: VIEW ALL\nsql\nASSIGN TO AN ACCOUNT\nThe account must already exist.\nsql\n
}
I'm not sure if this is a limitation/feature from Atlassian API
System Info
platform: windows node version: v22.2.0 yarn version: 1.22.22
Hey @jazelly! I'm here to help you with any issues you're facing. I can assist with bug fixes, answer questions, and guide you on becoming a contributor. Let me know how I can assist you!
The Confluence loader in LangChain.js returns partial pageContent and redacts SQL code blocks because it converts the HTML content to plain text using the htmlToText function. This conversion process can result in the loss of certain formatting and content, including code blocks [1].
To address this issue, you can customize the htmlToText function to preserve code blocks, including SQL code blocks, during the conversion process. Here is an example configuration:
const { htmlToText } = require('html-to-text');
const options = {
preserveNewlines: true,
format: {
pre: function(elem, walk, builder, formatOptions) {
builder.openBlock({ leadingLineBreaks: 1 });
walk(elem.children, builder);
builder.closeBlock({ trailingLineBreaks: 1 });
}
}
};
const text = htmlToText('<pre><code class="language-sql">SELECT * FROM users;</code></pre>', options);
console.log(text);
This configuration will preserve the SQL code block during the conversion process [2].
Additionally, ensure that the expand parameter is set to body.storage when making the API request to retrieve the full content, including code blocks [1].
To continue talking to Dosu, mention @dosu.
@dosu how do I preserve the HTML response instead of the plain text response when using load(). Looks like it's not exposed like python version
To preserve the HTML response instead of converting it to plain text when using the load() function in the Confluence loader, you need to modify the createDocumentFromPage method. Here is how you can do it:
-
Modify the
createDocumentFromPagemethod: Update the method to use the HTML content directly instead of converting it to plain text.
private createDocumentFromPage(page: ConfluencePage): Document {
// Use the HTML content directly
const htmlContent = page.body.storage.value;
// Generate the URL
const pageUrl = `${this.baseUrl}/spaces/${this.spaceKey}/pages/${page.id}`;
// Return a langchain document with HTML content
return new Document({
pageContent: htmlContent,
metadata: {
id: page.id,
status: page.status,
title: page.title,
type: page.type,
url: pageUrl,
version: page.version?.number,
updated_by: page.version?.by?.displayName,
updated_at: page.version?.when,
},
});
}
-
Ensure the
loadmethod uses the updatedcreateDocumentFromPage: Theloadmethod will automatically use the updatedcreateDocumentFromPagemethod to preserve the HTML content.
public async load(options?: {
start?: number;
limit?: number;
}): Promise<Document[]> {
try {
const pages = await this.fetchAllPagesInSpace(
options?.start,
options?.limit
);
return pages.map((page) => this.createDocumentFromPage(page));
} catch (error) {
console.error("Error:", error);
return [];
}
}
By making these changes, the load() function will preserve the HTML response instead of converting it to plain text [1].
To continue talking to Dosu, mention @dosu.
pretty sure this is an unsupported feature in html-to-text, as the returned data are XML from Atalassian, and html-to-text is unable to do this. I propose to use xmldom to do this.
I think the python version can do the job as it's using BeautifulSoup