langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Issue: Unable to load xml files

Open michelle-wheatley opened this issue 1 year ago • 2 comments

Issue you'd like to raise.

I am trying to create a FAISS index with xml files that I have downloaded. However, there doesn't seem to be a loader available for this. Are there any workarounds, or plans to add in a loader for xml files that can't be loaded with MWDumpLoader?

Suggestion:

Highlight a workaround for loading xml files in the documentation or add a document loader for them

michelle-wheatley avatar May 17 '23 14:05 michelle-wheatley

🤖 AI-generated response by Steercode - chat with Langchain codebase Disclaimer: SteerCode Chat may provide inaccurate information about the Langchain codebase.

Currently, there is no built-in loader for XML files other than MediaWiki XML dump files. However, you can create a custom loader for your specific XML files by extending the BaseLoader class and implementing the load method for your use case.

Here's a simple example of a custom XML loader that you can use as a starting point:

import xml.etree.ElementTree as ET
from langchain.document_loaders.base import BaseLoader
from langchain.document import Document

class CustomXMLLoader(BaseLoader):
    def __init__(self, file_path: str, encoding: str = "utf-8"):
        super().__init__(file_path, encoding)

    def load(self) -> List[Document]:
        tree = ET.parse(self.file_path)
        root = tree.getroot()

        docs = []
        for elem in root:
            # Extract relevant data from the XML element
            # e.g., text = elem.find("text").text, metadata = {"title": elem.find("title").text}
            # Create a Document object with the extracted data
            # e.g., doc = Document(page_content=text, metadata=metadata)
            # Append the Document object to the list of documents
            # e.g., docs.append(doc)

        return docs

You'll need to modify the example to suit your specific XML structure and the data you want to extract.

There are no immediate plans to add a generic XML loader to the LangChain codebase, but contributions are always welcome if you'd like to submit a pull request with an implementation that could benefit other users.

You can continue the conversation at https://steercode.com/

votrumar avatar May 18 '23 09:05 votrumar

The generated response doesn't address the need for semantic XML element names or attributes to weigh heavily in the precision of the model. For example, the DITA XML OASIS standard for technical documentation has elements to distinguish between different aspects of a procedure, such as prerequisites, steps, troubleshooting notes, and examples. A troubleshooting topic file will have semantic tags for describing a condition, followed by cause/remedy pairs.

I'd also think that any XML schemas defining a document's syntax could also be ingested to assist in the most accurate model creation.

kirkilj avatar May 21 '23 17:05 kirkilj

Hi, @michelle-wheatley! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you were having trouble loading XML files into a FAISS index and were looking for a workaround or plans to add a loader for XML files. One user suggested creating a custom loader by extending the BaseLoader class and provided an example. Another user mentioned the need for semantic XML element names and attributes to improve model precision and suggested ingesting XML schemas for accurate model creation.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

dosubot[bot] avatar Sep 12 '23 16:09 dosubot[bot]

I used this method applied to Bedrock and Claude v1 model:

from langchain_community.document_loaders import UnstructuredXMLLoader
pathlist = Path("files_samples/").glob('**/*.xml')
for path in pathlist:
        loader = UnstructuredXMLLoader(
        path,
    )
    document = loader.load()

    template = """

    Given a list of classes, classify the document only into one of these classes. Skip any preamble text and just give the class name.
    Do not add any other details appart from the class name.

    <classes>CAR, BIKE, PLANE</classes>
    <document>{doc_text}<document>
    <classification>"""
prompt = PromptTemplate(template=template, input_variables=["doc_text"])
bedrock_llm = Bedrock(client=bedrock, model_id=modelId)

print("doc size is: ", len(document[0].page_content))
if len(document[0].page_content) > 40000:
    print("document size is too big")
    continue

llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
class_name = llm_chain.run(document[0].page_content)

print(f"The object is:  {class_name}")
print("\n")

zdarova avatar Jan 10 '24 13:01 zdarova