langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Xlsx loader

Open Kashif-Raza6 opened this issue 1 year ago • 3 comments

How can we load directly xlsx file in langchain just like CSV loader? I could not be able to find in the documentation

Kashif-Raza6 avatar Mar 21 '23 17:03 Kashif-Raza6

Hi @Kashif-Raza6 I built a new XLSXLoader for loading .xlsx files. Please try it out and if it works I will create PR. Let me know if you have any issues, feel free to post the XLSX file so I can test on my end as well.

It uses openpyxl so if you haven't installed it yet, you need to do it with pip install openpyxl.

from openpyxl import load_workbook
from typing import Dict, List, Optional
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class XLSXLoader(BaseLoader):
   """Loads an XLSX file into a list of documents.

   Each document represents one row of the XLSX file. Every row is converted into a
   key/value pair and outputted to a new line in the document's page_content.

   The source for each document loaded from xlsx is set to the value of the
   'file_path' argument for all documents by default.
   You can override this by setting the 'source_column' argument to the
   name of a column in the XLSX file.
   The source of each document will then be set to the value of the column
   with the name specified in 'source_column'.

   Output Example:
       .. code-block:: txt

           column1: value1
           column2: value2
           column3: value3
   """

   def __init__(
           self,
           file_path: str,
           source_column: Optional[str] = None,
           sheet_name: Optional[str] = None,
           encoding: Optional[str] = None,
   ):
      self.file_path = file_path
      self.source_column = source_column
      self.sheet_name = sheet_name
      self.encoding = encoding

   def load(self) -> List[Document]:
      docs = []

      wb = load_workbook(filename=self.file_path, read_only=True, data_only=True)
      ws = wb[self.sheet_name] if self.sheet_name else wb.active

      headers = [cell.value for cell in ws[1]]

      for i, row in enumerate(ws.iter_rows(min_row=2)):
         row_values = [cell.value for cell in row]
         row_dict = dict(zip(headers, row_values))

         content = "\n".join(f"{k.strip()}: {v.strip()}" for k, v in row_dict.items() if v is not None)
         if self.source_column is not None:
            source = row_dict[self.source_column]
         else:
            source = self.file_path
         metadata = {"source": source, "row": i}
         doc = Document(page_content=content, metadata=metadata)
         docs.append(doc)

      return docs

manuel-soria avatar Mar 22 '23 13:03 manuel-soria

it returns an error:'NoneType' object has no attribute 'strip', how to resolve it?

yawudede avatar Apr 13 '23 07:04 yawudede

hello, xlsx is supported in the Unstructured library as of release 0.6.7. can we please add support for xlsx in langchain as well?

See: https://github.com/Unstructured-IO/unstructured/issues/587#issuecomment-1555018550

ptkinvent avatar May 19 '23 23:05 ptkinvent

it returns an error:'NoneType' object has no attribute 'strip', how to resolve it?

The input data (.xlsx) you are using might be having a non text (non string) value in the data. Better to convert the 'v' to string type and rerun it. It worked for me

pachgadehardik avatar May 30 '23 07:05 pachgadehardik

hey @pachgadehardik can you explain 'v in your comment ?

Omkar19202 avatar Aug 21 '23 10:08 Omkar19202

this can be closed now that xlsx files are supported through unstructured:

https://python.langchain.com/docs/integrations/document_loaders/excel

ptkinvent avatar Aug 26 '23 23:08 ptkinvent

Hi, @Kashif-Raza6! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue was about finding a way to directly load xlsx files in langchain. It seems that a user named manuel-soria has built a new XLSXLoader for loading .xlsx files and has provided code for it. Additionally, another user named ptkinvent suggests adding support for xlsx in langchain, as it is already supported in the Unstructured library.

I'm happy to inform you that the issue has been resolved. XLSX files can now be directly loaded in langchain through the new XLSXLoader built by manuel-soria. Support for xlsx files has been added to langchain, as it is already supported in the Unstructured library.

If this issue is still relevant to the latest version of the LangChain repository, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself. If no further action is taken, the issue will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository! Let me know if you have any further questions or concerns.

dosubot[bot] avatar Nov 25 '23 16:11 dosubot[bot]