unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/group_bullet_paragraph causes problems by returning a list

Open rchen19 opened this issue 4 months ago • 0 comments

Describe the bug passing unstructured.cleaners.core.group_bullet_paragraph to UnstructuredBaseLoader's post_processors will cause the code to break, because group_bullet_paragraph returns a List[str], and unstructured.documents.elements.Text.apply() method checks the output of group_bullet_paragraph, and throws an error if it is not str, see here:

if not isinstance(cleaned_text, str):  # pyright: ignore[reportUnnecessaryIsInstance]
            raise ValueError("Cleaner produced a non-string output.")

To Reproduce

loader = UnstructuredFileLoader("some_file_that_has_bullet_points.pdf",
                                mode="elements",
                                pdf_infer_table_structure=True,
                                skip_infer_table_types=['jpg', 'png', 'xls', 'xlsx'],
                                show_progress=True,
                                post_processors=[group_bullet_paragraph]
                                )
docs = loader.load()

Expected behavior The list of strings should be joined. Proposing replacing:

if not isinstance(cleaned_text, str):  # pyright: ignore[reportUnnecessaryIsInstance]
            raise ValueError("Cleaner produced a non-string output.")

with something like:

if isinstance(cleaned_text, list):
    cleaned_text = " ".join(cleaned_text)
if not isinstance(cleaned_text, str):  # pyright: ignore[reportUnnecessaryIsInstance]
    raise ValueError("Cleaner produced a non-string output.")

rchen19 avatar Feb 13 '24 23:02 rchen19