unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Garbled Text When Using Outlook Files in *.msg Format(non-unicode)

Open MinnyKuan opened this issue 5 months ago • 3 comments

image

When saving a file using Outlook and selecting the (*.msg) format instead of Unicode format, the text loaded by UnstructuredFileLoader will appear as garbled characters.

Like this:

[Document(page_content='\u0a0d\u0a0d톰宥\ueab8\ue6ae\u0a0dꆬ쪰\uedb7\ue9a4ㄨ㌱约ꐲ㋫\ue9a4꘩쉢ꖾꙂ쎳꾺뫇Ꟗ꩑ꓷꖧ외끗ꗏ엾ꛩꑐꆯൃഊ\u200a\u0a0d\u0a0d\ue2a9謁\ue8a4ꆦ\u0a0d\u0a0d떤约亱쒱캥疡욼\ueca6人涱皡䆡䎨䢤튬뎦䶱疡\ue2a9謁잧릸皡䆡人涱ꆬ쪰亱\uf3a9箲\uf5b3\ue2a9墥튩뎦꒤謁잧릸䎡\u0a0d\u0a0d\u0a0d\u0a0d\ue2a9謁잧릸撬\u0a0d펭䢤䶱疡\ue2a9謁잧릸皡亱\uf3a9ꐱ㋫ꐹ⣩䂤ꤩ곳낡뫊꿴뚸\ua97d곱롤ꇟൃഊ\u200a\u0a0d\u0a0d꒤謁\ue2bb直\ueab8\ue6ae\u0a0d꒤謁傦꾤랶\uf3a9炤约嶩ꆬ쪰늵\uf4a7斫䆡첾뮥䢤퇃侧틃\uf3a9箲\uf5b3疡涱纫䦧ꮴ䊳皡뇃\ue2bb䆡侹즮\uf8b5傦\uf1a9\uf3b1䎡\u0a0d䂡낡붤墥傦꾤亱侫撯謁떶䆡璥澵熳뺪\ue2bb直䎡\u0a0d\u0a0d䂡낡瞩솴疸떹䢤ﮭꎤ\ue3a8ꆬ쪰\uedb7톤疡厯侧宥墽謁떶皡Ꞥ\ue2a9謁\ueab8\ue6ae䎡\u0a0d\u0a0dഠഊ갊낡곊뢢ෟ먊꧖띥ꩼ끁뇈뵍ㅵ㠸㠸\ue0c2ഴഊ', metadata={'source': 'C:\\Users\\10908306\\Desktop\\email格式\\Yearend_party_notUnicode.msg', 'file_directory': 'C:\\Users\\10908306\\Desktop\\email格式', 'filename': 'Yearend_party_notUnicode.msg', 'languages': ['kor'], 'sent_to': ['䡗⁑敗晬牡⽥䡗⽑楗瑳潲 (None)'], 'subject': '榡溽킳炤约嶩檡〲㌲夠慥\u2d72湥\u2064慐瑲⁹人涱ꆬ쪰\uecbf殪䎡', 'filetype': 'application/vnd.ms-outlook', 'category': 'Title'})]

However, when using the Outlook format - Unicode, the normal content can be read. 😂 The following is my code:

data = UnstructuredFileLoader(
            file_path="mydocument_file_path",
            mode="elements",
            content_source="text/html",
            strategy="fast",
        )

document_elements = data.load()
document_elements

Is there a way to read msg files in different formats?🤔


unstructured: 0.12.2 python: 3.11.4

MinnyKuan avatar Jan 31 '24 08:01 MinnyKuan