langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Unable to load .eml email filed from DirectoryLoader with class UnstructuredEmailLoader: utf-8' codec can't decode byte 0x92 in position

Open sanasz91mdev opened this issue 1 year ago • 1 comments

System Info

Langchain version: 0.0.162 Platform: Windows python version: 3.11.3

I am trying to load all .eml files from my Directory with LoaderClass: UnstructuredEmailLoader to build index , but i am getting error on load function.

error: 'utf-8' codec can't decode byte 0x92 in position 141: invalid start byte

Code:

def load_data():
    if os.path.exists("./test"):
        # Load documents from data directory
        print('loading docs from directory ...')
        loader = DirectoryLoader('./test',loader_cls=UnstructuredEmailLoader)
        raw_documents = loader.load()
        print('loaded docs')
        #Splitting documents into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
        )
        documents = text_splitter.split_documents(raw_documents)
        print(len(documents))
        # Changing source to point to the original document
        for x in documents: 
            print(x)
        # Creating index and saving it to disk
        print("Creating index")
        db = FAISS.from_documents(documents, embeddings )
        db.save_local("./index")
    else:
        raise Exception("No data or index found")

Following is the code i am using to read email:

def get_email(data):
      # add try catch    
    raw_email_string = data[0][1].decode('utf-8')
    # Getting the email 
    email_obj = Email(raw_email_string)
    return email_obj

The problem is that there are 0ver 5000 + emails, the error exception cant even tell that which file is causing issue.

How can i fix this?

Who can help?

@hwchase17 @eyurtsev @vowelparrot

Information

  • [ ] The official example notebooks/scripts
  • [ ] My own modified scripts

Related Components

  • [ ] LLMs/Chat Models
  • [ ] Embedding Models
  • [ ] Prompts / Prompt Templates / Prompt Selectors
  • [ ] Output Parsers
  • [X] Document Loaders
  • [ ] Vector Stores / Retrievers
  • [ ] Memory
  • [ ] Agents / Agent Executors
  • [X] Tools / Toolkits
  • [ ] Chains
  • [ ] Callbacks/Tracing
  • [ ] Async

Reproduction

Use code to test issue.

Expected behavior

  1. should load all email .eml files within the provided directory
  2. In case of exception it should atleast print filename.

sanasz91mdev avatar May 11 '23 08:05 sanasz91mdev

the problem is that some emails contain characters that cant be decoded by utf-8. i changed the unstructured code in core.py:

def replace_mime_encodings(text: str, encoding: str = "utf-8") -> str:
    """Replaces MIME encodings with their equivalent characters in the specified encoding.

    Example
    -------
    5 w=E2=80-99s -> 5 w’s
    """
    try:
        return quopri.decodestring(text.encode()).decode("utf-8")
    except:
        return quopri.decodestring(text.encode()).decode("cp1252")

And that worked. But should i askUnstructured to create a PR to fix this? As there will be many email that will not pass this existing method.

sanasz91mdev avatar May 11 '23 11:05 sanasz91mdev

I had the same issue, would be great to create a PR with a solution

kafkasl avatar May 25 '23 09:05 kafkasl

@sanasz91mdev this has been resolved in this unstructured's pr . This issue can be closed.

kafkasl avatar Jun 23 '23 09:06 kafkasl

Hi, @sanasz91mdev! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you encountered an error when trying to load .eml email files using the UnstructuredEmailLoader class. You mentioned that you found a workaround by modifying the code, but you're unsure if you should create a pull request to fix the issue. Another user, @kafkasl, suggested creating a PR and mentioned that the issue has already been resolved in a pull request in the Unstructured repository.

Before we proceed, we would like to confirm if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and cooperation. We appreciate your contribution to the LangChain community!

dosubot[bot] avatar Sep 22 '23 16:09 dosubot[bot]