bug(json): partition_json() does not preserve original element_id or metadata

Open scanny opened this issue 1 year ago • 0 comments

Summary The contract of partition_json() is to "rehydrate" the JSON elements serialized to a JSON array of element objects. However, it changes the element_id and certain metadata fields from their original values.

To Reproduce

file_path = example_doc_path("simple.json")
original_elements = elements_from_json(file_path)
partitioned_elements = partition_json(file_path)

assert elements_to_dicts(partitioned_elements) == elements_to_dicts(original_elements)

produces:

    [
        {
  -         'element_id': 'a06d2d9e65212d4aa955c3ab32950ffa',
  +         'element_id': 'dbc05298f7937a62027af643bd1c3c87',
            'metadata': {
                'category_depth': 0,
  -             'file_directory': 'unstructured/example-docs',
  +             'file_directory': '/Users/scanny/src/unstructured/example-docs',
  -             'filename': 'simple.docx',
  ?                                 ^ ^^
  +             'filename': 'simple.json',
  ?                                 ^^ ^
  -             'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
  +             'filetype': 'application/json',
                'languages': [
                    'eng',
                ],
  -             'last_modified': '2024-07-06T16:44:51',
  ?                                        ^ ^  ^^^^^
  +             'last_modified': '2024-07-08T23:06:02',
  ?                                        ^ ^^^^  ^^
            },
            'text': 'These are a few of my favorite things:',
            'type': 'Title',
        },

Expected behavior Because partition_json() is the mechanism for step-wise processing of documents using the API, (like elements_from_json() is using unstructured open-source directly), identifiers and metadata should be unchanged from their serialized state. Note that step-wise processing includes chunking as a separate step, perhaps after filtering elements from the original payload or enhancing metadata.

Additional context

Much of the problem appears to be caused by double-post-processing. partition_json() uses the same @process_metadata() and @add_metadata_with_filetype() decorators that other partitioners do, but since it is not actually a partitioner, those metadata and id post-processing steps are not needed and causes these and perhaps other unwelcome behaviors.

Jul 09 '24 06:07 scanny