docx metadata generate strategy
platform:linux python version:3.10 unstructured version:0.14.9
'metadata': { 'category_depth': 0, 'file_directory': 'path', 'filename': 'file name', 'last_modified': '2024-07-06T17:25:17', 'languages': ['eng'], 'parent_id': 'ae30a862210d112b2a17a108813c8394', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document' }
The parent_id I understand is generated according to the Word outline level. For example, an article has five chapters, and the first chapter has three sub chapters. The parent_id of these three chapters should be the same, which is the element_id of the first chapter. However, the results of my experiment seem to be different. I would like to ask how this algorithm works? Thank you!!!
Parent-id is set in this function, applied to the element stream after partitioning: https://github.com/Unstructured-IO/unstructured/blob/1ce01c3254804530c9e82be7c82f3458dd5fca85/unstructured/partition/common.py#L233-L278
Closing as inactive, assumed resolved.