langchain
langchain copied to clipboard
Xlsx loader
How can we load directly xlsx file in langchain just like CSV loader? I could not be able to find in the documentation
Hi @Kashif-Raza6 I built a new XLSXLoader for loading .xlsx files. Please try it out and if it works I will create PR. Let me know if you have any issues, feel free to post the XLSX file so I can test on my end as well.
It uses openpyxl
so if you haven't installed it yet, you need to do it with pip install openpyxl
.
from openpyxl import load_workbook
from typing import Dict, List, Optional
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
class XLSXLoader(BaseLoader):
"""Loads an XLSX file into a list of documents.
Each document represents one row of the XLSX file. Every row is converted into a
key/value pair and outputted to a new line in the document's page_content.
The source for each document loaded from xlsx is set to the value of the
'file_path' argument for all documents by default.
You can override this by setting the 'source_column' argument to the
name of a column in the XLSX file.
The source of each document will then be set to the value of the column
with the name specified in 'source_column'.
Output Example:
.. code-block:: txt
column1: value1
column2: value2
column3: value3
"""
def __init__(
self,
file_path: str,
source_column: Optional[str] = None,
sheet_name: Optional[str] = None,
encoding: Optional[str] = None,
):
self.file_path = file_path
self.source_column = source_column
self.sheet_name = sheet_name
self.encoding = encoding
def load(self) -> List[Document]:
docs = []
wb = load_workbook(filename=self.file_path, read_only=True, data_only=True)
ws = wb[self.sheet_name] if self.sheet_name else wb.active
headers = [cell.value for cell in ws[1]]
for i, row in enumerate(ws.iter_rows(min_row=2)):
row_values = [cell.value for cell in row]
row_dict = dict(zip(headers, row_values))
content = "\n".join(f"{k.strip()}: {v.strip()}" for k, v in row_dict.items() if v is not None)
if self.source_column is not None:
source = row_dict[self.source_column]
else:
source = self.file_path
metadata = {"source": source, "row": i}
doc = Document(page_content=content, metadata=metadata)
docs.append(doc)
return docs
it returns an error:'NoneType' object has no attribute 'strip', how to resolve it?
hello, xlsx is supported in the Unstructured library as of release 0.6.7
. can we please add support for xlsx in langchain as well?
See: https://github.com/Unstructured-IO/unstructured/issues/587#issuecomment-1555018550
it returns an error:'NoneType' object has no attribute 'strip', how to resolve it?
The input data (.xlsx) you are using might be having a non text (non string) value in the data. Better to convert the 'v' to string type and rerun it. It worked for me
hey @pachgadehardik can you explain 'v in your comment ?
this can be closed now that xlsx files are supported through unstructured:
https://python.langchain.com/docs/integrations/document_loaders/excel
Hi, @Kashif-Raza6! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue was about finding a way to directly load xlsx files in langchain. It seems that a user named manuel-soria has built a new XLSXLoader for loading .xlsx files and has provided code for it. Additionally, another user named ptkinvent suggests adding support for xlsx in langchain, as it is already supported in the Unstructured library.
I'm happy to inform you that the issue has been resolved. XLSX files can now be directly loaded in langchain through the new XLSXLoader built by manuel-soria. Support for xlsx files has been added to langchain, as it is already supported in the Unstructured library.
If this issue is still relevant to the latest version of the LangChain repository, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself. If no further action is taken, the issue will be automatically closed in 7 days.
Thank you for your contribution to the LangChain repository! Let me know if you have any further questions or concerns.