unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/Wrongly detected fileType for exported documents

Open srisudarsan opened this issue 10 months ago • 1 comments

Describe the bug I have a document exported from confluence which is downloaded as a .doc file, on trying to partition this file, getting errors as it is not able to detect the file extension. This occurs when the file is sent as byte stream and not when the file is sent as byte stream (as similar how unstructured python client SDK does this)

File partition fails with message "unstructured.partition.common.UnsupportedFileFormatError: Partitioning is not supported for the FileType.UNK file type." when using unstructured directly, "expected str, bytes or os.PathLike object, not int" when using client SDK

To Reproduce from io import BytesIO from unstructured.partition.auto import partition

with open("Test.doc", "rb") as f: # Not directly sending the stream but sending it as wrapped bytes like how client SDK sends stream as uploaded file elements = partition(file=BytesIO(f.read()))

Expected behavior The extension should be detected as .doc and should return partitions.

Screenshots NA

Environment Info Python version: 3.10.15 unstructured version: 0.17.5 unstructured-inference version: 0.8.10

Additional context The .doc format exported by confluence does not contain magic bytes, thus OLE file detection files and all other detection step fails, When reaching line , it fails as the file name returned is random and is sometimes an integer.

Since this is the last effort to identify the extension, can we use metadata file name before doing this check ? This PR proposes a potential fix for the same - https://github.com/Unstructured-IO/unstructured/pull/3786

srisudarsan avatar Apr 06 '25 16:04 srisudarsan

Summary

This patch enables detection of Confluence-exported legacy.doc files by inspecting the OLE file structure and searching for a "Confluence" marker. If detected, the file type is set to FileType.CONFLUENCE_DOC. These files are routed to the DOC partitioner for now.

In filetype.py add

@property
def is_confluence_doc(self) -> bool:
    """Detects Confluence-exported legacy .doc files by extension, mime type, and marker."""
    if self._ctx.extension == ".doc" and self._ctx.mime_type == "application/msword":
        try:
            with self._ctx.open() as f:
                head = f.read(4096)
                if b"Confluence" in head:
                    return True
        except Exception:
            pass
    return False


@lazyproperty
def _file_type(self) -> FileType:
    # ... (existing detection logic)
    if self.is_confluence_doc:
        return FileType.CONFLUENCE_DOC
    file_type = FileType.from_mime_type(self.mime_type)
    return file_type if file_type != FileType.UNK else None

@lazyproperty
def _ole_file_type(self) -> FileType | None:
    with self._ctx.open() as f:
        ole = OleFileIO(f)
        streams = ole.listdir(streams=True)
        stream_names = {"/".join(s) for s in streams}

    if "WordDocument" in stream_names:
        # Check for Confluence marker in first 4096 bytes
        with self._ctx.open() as f2:
            head = f2.read(4096)
            if b"Confluence" in head:
                return FileType.CONFLUENCE_DOC
        return FileType.DOC
    elif "PowerPoint Document" in stream_names:
        return FileType.PPT
    elif "Workbook" in stream_names:
        return FileType.XLS
    elif "__properties_version1.0" in stream_names:
        return FileType.MSG
    return None

In model.py add:


 CONFLUENCE_DOC = (
    "confluence_doc",
    "confluence_doc",
    cast(list[str], []), 
    None,
    [".doc"],  
    "application/msword", 
    cast(list[str], []),
)

In test_unstructured/file_utils/test_filetype.py add:

def test_it_detects_confluence_doc_filetype():
    confluence_doc_path = example_doc_path("confluence/confluence_report.doc")
    file_type = detect_filetype(confluence_doc_path)
    assert file_type == FileType.CONFLUENCE_DOC

Note

This PR has not been pushed because some tests fail due to the missing oxmsg dependency, which is not available on PyPI and is likely deprecated. If your project has migrated to extract-msg or another supported package, consider updating or skipping these tests accordingly. Also need a sample confluence_report.doc.

RoughneckCoder avatar Aug 05 '25 14:08 RoughneckCoder