unstructured
unstructured copied to clipboard
Problems when I parsing Chineses PDF documents
Hi, When I use partition_type(file=io.BytesIO(file.file.read()),languages=["chi_sim"])
to parse Chinese pdf documents, I found the result was to split the paragraph text into a line text as a elemet. And another problem is element type isn't accurate, should be UncategorizedText but actually is Title
Hi @WangJiaxin-x - do you have an example document available that we could use to replicate this? Thanks!
ok,let me give an example.I will give the two documents,One is the raw pdf file,another is the json which i use the code below to get.The bug is some elements should be Title
,But it is UncategorizedText
in the result. Also, the result shows that some paragraphs can't recognized ,you can see the json,a line in paragraphs is recognized as a element.So a paragraphs is split into many elements.I think it is not a good result.
Hope your reply,Thanks!!!
import json
from typing import Iterable, Optional
from unstructured.documents.elements import Element
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json, _fix_metadata_field_precision, elements_to_dicts
elements = partition_pdf(filename=r"C:\Users\A\Desktop\test.pdf",
languages=["chi_sim"]) # bytes -> BinaryIO
def elements_to_json_chi(
elements: Iterable[Element],
filename: Optional[str] = None,
indent: int = 4,
encoding: str = "utf-8",
) -> Optional[str]:
"""Saves a list of elements to a JSON file if filename is specified.
Otherwise, return the list of elements as a string.
"""
# -- serialize `elements` as a JSON array (str) --
precision_adjusted_elements = _fix_metadata_field_precision(elements)
element_dicts = elements_to_dicts(precision_adjusted_elements)
json_str = json.dumps(element_dicts, ensure_ascii=False, indent=indent, sort_keys=True)
if filename is not None:
with open(filename, "w", encoding=encoding) as f:
f.write(json_str)
return None
return json_str
elements_to_json_chi(elements, filename="./test_json.json")
also,it shows that in package unstructured.staging.base
the func elements_to_json
has some encoding bugs in chinese.The parameter ensure_ascii
in json.dump
,I think shoule be false.
test.pdf
test_json.json
Thank you for the example! We're tracking this and will investigate as soon as we can.
ok,thanks for your help!!!