unstructured
unstructured copied to clipboard
bug(json): partition() places entire JSON file into text of single element when `metadata_filename` has .html extension
Describe the bug
When partitioning a JSON file using partition() and providing a metadata_filename argument that has a .html extension, the result is a single element with the entire JSON file contents as its text.
To Reproduce
file_path = example_doc_path("simple.json")
with open(file_path, "rb") as f:
elements = partition(file=f, metadata_filename="simple.html")
print(f"{elements}")
print(f"{elements[0].text}")
produces:
[<unstructured.documents.elements.NarrativeText object at 0x371953bb0>]
[
{
"element_id": "a06d2d9e65212d4aa955c3ab32950ffa",
"metadata": {
"category_depth": 0,
"file_directory": "unstructured/example-docs",
"filename": "simple.docx",
"filetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"languages": [
"eng"
],
"last_modified": "2024-07-06T16:44:51"
},
"text": "These are a few of my favorite things:",
"type": "Title"
},
{
"element_id": "b334c93e9b1cbca3b6f6d78ce8bc2484",
"metadata": {
...
Expected behavior
The same output as elements_from_json("simple.json"). The metadata_filename argument should be ignored.
Additional context
- Because this behavior does not occur when using
partition_json(), I believe it is an artifact ofdetect_filetype()somehow usingmetadata_filenamefor disambiguation. Because the original filename for serialized elements was definitely notsomething.json, that's not going to work for JSON files.
Turns out other weird things happen when the metadata_filename has different extensions, like simple.docx. So it would appear the file-type is getting mis-identified and the file is sent to the wrong partitioner.
Fixed months ago, probably in #3429.