docx: prefer document.core_properties.modified to filesystem last-modified
The docx format includes Dublin Core metadata in its core.xml "part". This metadata reliably includes a modified timestamp in ISO 8601 form, e.g. 2023-09-14T04:12:00Z. Because this timestamp is contained in the document, it survives file-copy and other operations that can change the filesystem timestamp.
This makes it an inherently more reliable source of the document "last-modified" date.
Proposed: Recover this date from .docx documents and use in preference to the filesystem date.
At present there is a two-level preference filter:
- If a
metadata_last_modifiedvalue is received it means the present document was converted from some other, non-docx format and the last-modified date is determined by the source-file partitioner before initiating conversion. Use this when present. - Use the filesystem timestamp of the current file if available.
- Otherwise use
None
This proposal is to insert getting last-modified from the Dublin Core metadata in the .docx document between steps 1 and 2. This timestamp will not reliably survive the conversion process so metadata_last_modified is still a better source for documents converted from other formats.
@scanny - Is this one still relevant?
@MthwRobinson I'd say it's a product question. I expect doing this would improve the last-modified date in the metadata and we haven't worked on or completed this ticket yet. However I haven't seen any complaints about last-modified not being good enough as it is, so maybe this is one to let go of and reconsider later when someone actually asks for better .metadata.last_modified.
Yeah let's keep this on the backlog. I think this is a good one to do if we have bandwidth. Thanks!