langchain
langchain copied to clipboard
Unable to load .eml email filed from DirectoryLoader with class UnstructuredEmailLoader: utf-8' codec can't decode byte 0x92 in position
System Info
Langchain version: 0.0.162 Platform: Windows python version: 3.11.3
I am trying to load all .eml files from my Directory with LoaderClass: UnstructuredEmailLoader to build index , but i am getting error on load function.
error: 'utf-8' codec can't decode byte 0x92 in position 141: invalid start byte
Code:
def load_data():
if os.path.exists("./test"):
# Load documents from data directory
print('loading docs from directory ...')
loader = DirectoryLoader('./test',loader_cls=UnstructuredEmailLoader)
raw_documents = loader.load()
print('loaded docs')
#Splitting documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
documents = text_splitter.split_documents(raw_documents)
print(len(documents))
# Changing source to point to the original document
for x in documents:
print(x)
# Creating index and saving it to disk
print("Creating index")
db = FAISS.from_documents(documents, embeddings )
db.save_local("./index")
else:
raise Exception("No data or index found")
Following is the code i am using to read email:
def get_email(data):
# add try catch
raw_email_string = data[0][1].decode('utf-8')
# Getting the email
email_obj = Email(raw_email_string)
return email_obj
The problem is that there are 0ver 5000 + emails, the error exception cant even tell that which file is causing issue.
How can i fix this?
Who can help?
@hwchase17 @eyurtsev @vowelparrot
Information
- [ ] The official example notebooks/scripts
- [ ] My own modified scripts
Related Components
- [ ] LLMs/Chat Models
- [ ] Embedding Models
- [ ] Prompts / Prompt Templates / Prompt Selectors
- [ ] Output Parsers
- [X] Document Loaders
- [ ] Vector Stores / Retrievers
- [ ] Memory
- [ ] Agents / Agent Executors
- [X] Tools / Toolkits
- [ ] Chains
- [ ] Callbacks/Tracing
- [ ] Async
Reproduction
Use code to test issue.
Expected behavior
- should load all email .eml files within the provided directory
- In case of exception it should atleast print filename.
the problem is that some emails contain characters that cant be decoded by utf-8. i changed the unstructured code in core.py:
def replace_mime_encodings(text: str, encoding: str = "utf-8") -> str:
"""Replaces MIME encodings with their equivalent characters in the specified encoding.
Example
-------
5 w=E2=80-99s -> 5 w’s
"""
try:
return quopri.decodestring(text.encode()).decode("utf-8")
except:
return quopri.decodestring(text.encode()).decode("cp1252")
And that worked. But should i askUnstructured to create a PR to fix this? As there will be many email that will not pass this existing method.
I had the same issue, would be great to create a PR with a solution
@sanasz91mdev this has been resolved in this unstructured's pr . This issue can be closed.
Hi, @sanasz91mdev! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, you encountered an error when trying to load .eml email files using the UnstructuredEmailLoader class. You mentioned that you found a workaround by modifying the code, but you're unsure if you should create a pull request to fix the issue. Another user, @kafkasl, suggested creating a PR and mentioned that the issue has already been resolved in a pull request in the Unstructured repository.
Before we proceed, we would like to confirm if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.
Thank you for your understanding and cooperation. We appreciate your contribution to the LangChain community!