unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

docx metadata generate strategy

Open Yue-Rain opened this issue 1 year ago • 1 comments

platform:linux python version:3.10 unstructured version:0.14.9

'metadata': { 'category_depth': 0, 'file_directory': 'path', 'filename': 'file name', 'last_modified': '2024-07-06T17:25:17', 'languages': ['eng'], 'parent_id': 'ae30a862210d112b2a17a108813c8394', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document' }

The parent_id I understand is generated according to the Word outline level. For example, an article has five chapters, and the first chapter has three sub chapters. The parent_id of these three chapters should be the same, which is the element_id of the first chapter. However, the results of my experiment seem to be different. I would like to ask how this algorithm works? Thank you!!!

Yue-Rain avatar Jul 07 '24 03:07 Yue-Rain

Parent-id is set in this function, applied to the element stream after partitioning: https://github.com/Unstructured-IO/unstructured/blob/1ce01c3254804530c9e82be7c82f3458dd5fca85/unstructured/partition/common.py#L233-L278

scanny avatar Jul 07 '24 18:07 scanny

Closing as inactive, assumed resolved.

scanny avatar Dec 18 '24 07:12 scanny