unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

fix(odt): fix disk-space leak in partition_odt()

Open scanny opened this issue 1 year ago • 0 comments

Remedy disk-space leak where partition_odt() would leave an on-disk copy of each .odt file passed as a file-like object.

partition_odt() creates a temporary file in which it writes each source-document provided as a file-like object. This file is not deleted and disk consumption grows without bound.

The convert_and_partition_docx() function used to convert ODT->DOCX uses pandoc (a command-line program) to do the conversion. Because this command-line program operates in a different memory space, the source file cannot be passed as an in-memory object and needs to be on the filesystem. When the ODT source-document is passed as a file-like object, it is written to disk so the conversion program has access to it. It is not deleted afterward.

Fix this by writing the temporary source ODT file in a TemporaryDirectory and also use that location to write the conversion-target DOCX file. That directory is automatically removed when partition_odt() completes.

While we're in there, improve the factoring of partition_odt().

  • Extract convert_and_partition_docx() from partition.docx (used only by partition_odt()) to _convert_odt_to_docx() in partition.odt where it is used. Decouple file conversion from calling partition_docx() with the converted file as the partition_docx() call is partition_odt()'s natural responsibility.
  • Improve docstrings, typing, and comments.
  • All tests pass both before and after.

scanny avatar May 16 '24 16:05 scanny