unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Problems when I parsing Chineses PDF documents

Open WangJiaxin-x opened this issue 9 months ago • 4 comments

Hi, When I use partition_type(file=io.BytesIO(file.file.read()),languages=["chi_sim"]) to parse Chinese pdf documents, I found the result was to split the paragraph text into a line text as a elemet. And another problem is element type isn't accurate, should be UncategorizedText but actually is Title

WangJiaxin-x avatar May 10 '24 08:05 WangJiaxin-x

Hi @WangJiaxin-x - do you have an example document available that we could use to replicate this? Thanks!

MthwRobinson avatar May 10 '24 12:05 MthwRobinson

ok,let me give an example.I will give the two documents,One is the raw pdf file,another is the json which i use the code below to get.The bug is some elements should be Title,But it is UncategorizedText in the result. Also, the result shows that some paragraphs can't recognized ,you can see the json,a line in paragraphs is recognized as a element.So a paragraphs is split into many elements.I think it is not a good result. Hope your reply,Thanks!!!

import json
from typing import Iterable, Optional

from unstructured.documents.elements import Element
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json, _fix_metadata_field_precision, elements_to_dicts

elements = partition_pdf(filename=r"C:\Users\A\Desktop\test.pdf",
                         languages=["chi_sim"])  # bytes -> BinaryIO


def elements_to_json_chi(
        elements: Iterable[Element],
        filename: Optional[str] = None,
        indent: int = 4,
        encoding: str = "utf-8",
) -> Optional[str]:
    """Saves a list of elements to a JSON file if filename is specified.

    Otherwise, return the list of elements as a string.
    """
    # -- serialize `elements` as a JSON array (str) --
    precision_adjusted_elements = _fix_metadata_field_precision(elements)
    element_dicts = elements_to_dicts(precision_adjusted_elements)
    json_str = json.dumps(element_dicts, ensure_ascii=False, indent=indent, sort_keys=True)

    if filename is not None:
        with open(filename, "w", encoding=encoding) as f:
            f.write(json_str)
        return None

    return json_str


elements_to_json_chi(elements, filename="./test_json.json")

also,it shows that in package unstructured.staging.base the func elements_to_json has some encoding bugs in chinese.The parameter ensure_ascii in json.dump,I think shoule be false. test.pdf test_json.json

idiotTest avatar May 11 '24 03:05 idiotTest

Thank you for the example! We're tracking this and will investigate as soon as we can.

MthwRobinson avatar May 13 '24 12:05 MthwRobinson

ok,thanks for your help!!!

idiotTest avatar May 14 '24 08:05 idiotTest