bug/Wrongly detected fileType for exported documents
Describe the bug I have a document exported from confluence which is downloaded as a .doc file, on trying to partition this file, getting errors as it is not able to detect the file extension. This occurs when the file is sent as byte stream and not when the file is sent as byte stream (as similar how unstructured python client SDK does this)
File partition fails with message "unstructured.partition.common.UnsupportedFileFormatError: Partitioning is not supported for the FileType.UNK file type." when using unstructured directly, "expected str, bytes or os.PathLike object, not int" when using client SDK
To Reproduce from io import BytesIO from unstructured.partition.auto import partition
with open("Test.doc", "rb") as f: # Not directly sending the stream but sending it as wrapped bytes like how client SDK sends stream as uploaded file elements = partition(file=BytesIO(f.read()))
Expected behavior The extension should be detected as .doc and should return partitions.
Screenshots NA
Environment Info Python version: 3.10.15 unstructured version: 0.17.5 unstructured-inference version: 0.8.10
Additional context The .doc format exported by confluence does not contain magic bytes, thus OLE file detection files and all other detection step fails, When reaching line , it fails as the file name returned is random and is sometimes an integer.
Since this is the last effort to identify the extension, can we use metadata file name before doing this check ? This PR proposes a potential fix for the same - https://github.com/Unstructured-IO/unstructured/pull/3786
Summary
This patch enables detection of Confluence-exported legacy.doc files by inspecting the OLE file structure and searching for a "Confluence" marker. If detected, the file type is set to FileType.CONFLUENCE_DOC. These files are routed to the DOC partitioner for now.
In filetype.py add
@property
def is_confluence_doc(self) -> bool:
"""Detects Confluence-exported legacy .doc files by extension, mime type, and marker."""
if self._ctx.extension == ".doc" and self._ctx.mime_type == "application/msword":
try:
with self._ctx.open() as f:
head = f.read(4096)
if b"Confluence" in head:
return True
except Exception:
pass
return False
@lazyproperty
def _file_type(self) -> FileType:
# ... (existing detection logic)
if self.is_confluence_doc:
return FileType.CONFLUENCE_DOC
file_type = FileType.from_mime_type(self.mime_type)
return file_type if file_type != FileType.UNK else None
@lazyproperty
def _ole_file_type(self) -> FileType | None:
with self._ctx.open() as f:
ole = OleFileIO(f)
streams = ole.listdir(streams=True)
stream_names = {"/".join(s) for s in streams}
if "WordDocument" in stream_names:
# Check for Confluence marker in first 4096 bytes
with self._ctx.open() as f2:
head = f2.read(4096)
if b"Confluence" in head:
return FileType.CONFLUENCE_DOC
return FileType.DOC
elif "PowerPoint Document" in stream_names:
return FileType.PPT
elif "Workbook" in stream_names:
return FileType.XLS
elif "__properties_version1.0" in stream_names:
return FileType.MSG
return None
In model.py add:
CONFLUENCE_DOC = (
"confluence_doc",
"confluence_doc",
cast(list[str], []),
None,
[".doc"],
"application/msword",
cast(list[str], []),
)
In test_unstructured/file_utils/test_filetype.py add:
def test_it_detects_confluence_doc_filetype():
confluence_doc_path = example_doc_path("confluence/confluence_report.doc")
file_type = detect_filetype(confluence_doc_path)
assert file_type == FileType.CONFLUENCE_DOC
Note
This PR has not been pushed because some tests fail due to the missing oxmsg dependency, which is not available on PyPI and is likely deprecated. If your project has migrated to extract-msg or another supported package, consider updating or skipping these tests accordingly. Also need a sample confluence_report.doc.