langchain Unable to load .eml email filed from DirectoryLoader with class UnstructuredEmailLoader: utf-8' codec can't decode byte 0x92 in position

System Info

Langchain version: 0.0.162 Platform: Windows python version: 3.11.3

I am trying to load all .eml files from my Directory with LoaderClass: UnstructuredEmailLoader to build index , but i am getting error on load function.

error: 'utf-8' codec can't decode byte 0x92 in position 141: invalid start byte

Code:

def load_data():
    if os.path.exists("./test"):
        # Load documents from data directory
        print('loading docs from directory ...')
        loader = DirectoryLoader('./test',loader_cls=UnstructuredEmailLoader)
        raw_documents = loader.load()
        print('loaded docs')
        #Splitting documents into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
        )
        documents = text_splitter.split_documents(raw_documents)
        print(len(documents))
        # Changing source to point to the original document
        for x in documents: 
            print(x)
        # Creating index and saving it to disk
        print("Creating index")
        db = FAISS.from_documents(documents, embeddings )
        db.save_local("./index")
    else:
        raise Exception("No data or index found")

Following is the code i am using to read email:

def get_email(data):
      # add try catch    
    raw_email_string = data[0][1].decode('utf-8')
    # Getting the email 
    email_obj = Email(raw_email_string)
    return email_obj

The problem is that there are 0ver 5000 + emails, the error exception cant even tell that which file is causing issue.

How can i fix this?

Who can help?

@hwchase17 @eyurtsev @vowelparrot

Information

[ ] The official example notebooks/scripts
[ ] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[X] Document Loaders
[ ] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[X] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

Use code to test issue.

Expected behavior

should load all email .eml files within the provided directory
In case of exception it should atleast print filename.

May 11 '23 08:05 sanasz91mdev

the problem is that some emails contain characters that cant be decoded by utf-8. i changed the unstructured code in core.py:

def replace_mime_encodings(text: str, encoding: str = "utf-8") -> str:
    """Replaces MIME encodings with their equivalent characters in the specified encoding.

    Example
    -------
    5 w=E2=80-99s -> 5 w’s
    """
    try:
        return quopri.decodestring(text.encode()).decode("utf-8")
    except:
        return quopri.decodestring(text.encode()).decode("cp1252")

And that worked. But should i askUnstructured to create a PR to fix this? As there will be many email that will not pass this existing method.

May 11 '23 11:05 sanasz91mdev

I had the same issue, would be great to create a PR with a solution

May 25 '23 09:05 kafkasl

@sanasz91mdev this has been resolved in this unstructured's pr . This issue can be closed.

Jun 23 '23 09:06 kafkasl

Hi, @sanasz91mdev! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you encountered an error when trying to load .eml email files using the UnstructuredEmailLoader class. You mentioned that you found a workaround by modifying the code, but you're unsure if you should create a pull request to fix the issue. Another user, @kafkasl, suggested creating a PR and mentioned that the issue has already been resolved in a pull request in the Unstructured repository.

Before we proceed, we would like to confirm if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and cooperation. We appreciate your contribution to the LangChain community!

Sep 22 '23 16:09 dosubot[bot]

langchain langchain copied to clipboard

Unable to load .eml email filed from DirectoryLoader with class UnstructuredEmailLoader: utf-8' codec can't decode byte 0x92 in position

System Info

Who can help?

Information

Related Components

Reproduction

Expected behavior

langchain
langchain copied to clipboard