unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

docx: prefer document.core_properties.modified to filesystem last-modified

Open scanny opened this issue 2 years ago • 3 comments

The docx format includes Dublin Core metadata in its core.xml "part". This metadata reliably includes a modified timestamp in ISO 8601 form, e.g. 2023-09-14T04:12:00Z. Because this timestamp is contained in the document, it survives file-copy and other operations that can change the filesystem timestamp.

This makes it an inherently more reliable source of the document "last-modified" date.

Proposed: Recover this date from .docx documents and use in preference to the filesystem date.

At present there is a two-level preference filter:

  1. If a metadata_last_modified value is received it means the present document was converted from some other, non-docx format and the last-modified date is determined by the source-file partitioner before initiating conversion. Use this when present.
  2. Use the filesystem timestamp of the current file if available.
  3. Otherwise use None

This proposal is to insert getting last-modified from the Dublin Core metadata in the .docx document between steps 1 and 2. This timestamp will not reliably survive the conversion process so metadata_last_modified is still a better source for documents converted from other formats.

scanny avatar Sep 19 '23 00:09 scanny

@scanny - Is this one still relevant?

MthwRobinson avatar May 13 '24 13:05 MthwRobinson

@MthwRobinson I'd say it's a product question. I expect doing this would improve the last-modified date in the metadata and we haven't worked on or completed this ticket yet. However I haven't seen any complaints about last-modified not being good enough as it is, so maybe this is one to let go of and reconsider later when someone actually asks for better .metadata.last_modified.

scanny avatar May 13 '24 17:05 scanny

Yeah let's keep this on the backlog. I think this is a good one to do if we have bandwidth. Thanks!

MthwRobinson avatar May 13 '24 18:05 MthwRobinson