data-prep-kit [Feature] Html2ParquetTransform support output_format

Search before asking

[X] I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

Running through "RAG with Data Prep Kit" focusing on Step-2:

https://github.com/IBM/data-prep-kit/blob/6a06d8763ede388a93e956635af530df3494a9c8/examples/notebooks/rag/rag_1A_dpk_process_python.ipynb

Step-2: Process Input Documents (RAG stage 1, 2 & 3)

This code uses DPK to

    Extract text from PDFs (RAG stage-1)
    Performs de-dupes (RAG stage-1)
    split the documents into chunks (RAG stage-2)
    vectorize the chunks (RAG stage-3)

In Extract text from PDFs (RAG stage-1) when calling pdf2parquet_transform_python there are three options for output pdf2parquet_contents_types: markdown, text, json

https://github.com/IBM/data-prep-kit/blob/6a06d8763ede388a93e956635af530df3494a9c8/transforms/language/pdf2parquet/dpk_pdf2parquet/transform.py#L67

class pdf2parquet_contents_types(str, enum.Enum):
    MARKDOWN = "text/markdown"
    TEXT = "text/plain"
    JSON = "application/json"

The next step split the documents into chunks (RAG stage-2) when calling doc_chunk there are three options for chunking_type: dl_json, li_markdown, li_token_text with default = dl_json

https://github.com/IBM/data-prep-kit/blob/6a06d8763ede388a93e956635af530df3494a9c8/transforms/language/doc_chunk/README.md?plain=1#L66

Thus we see the example code use Stage-1 output = json and Stage-2 type = json.

When attempting to change Stage-1 to Extract text from HTML using html2parquet_transform_python there are two options for html2parquet_output_format: markdown, txt

https://github.com/IBM/data-prep-kit/blob/6a06d8763ede388a93e956635af530df3494a9c8/transforms/language/html2parquet/dpk_html2parquet/transform.py#L168

class html2parquet_output_format(str, enum.Enum):
    MARKDOWN = "markdown"
    TEXT = "txt"

However html2parquet_transform_python reports to use Trafilatura where Trafilatura also supports JSON output:

https://trafilatura.readthedocs.io/en/latest/usage-python.html

Output

By default, the output is in plain text (TXT) format without metadata. The following additional formats are available:

    CSV
    HTML (from version 1.11 onwards)
    JSON
    Markdown (from version 1.9 onwards)
    XML and XML-TEI (following the guidelines of the Text Encoding Initiative)

To specify the output format, use one of the following strings: "csv", "json", "html", "markdown", "txt", "xml", "xmltei".

I will be attempting to change Stage-2 to work on markdown/text to align with the current supported outputs formats of html2parquet_transform_python.

However it seems html2parquet_transform_python could allow html2parquet_output_format: json which would pass-through to Trafilatura which already supports JSON. This would allow the flow of Stage-2 and beyond in the RAG example(s) to be maintained since they default to JSON.

Could you please consider adding JSON support in html2parquet_output_format (similar to code below, and whatever other downstream changes may be required) to align with pdf2parquet_output_format options along with the underlying Trafilatura supported options.

class html2parquet_output_format(str, enum.Enum):
    MARKDOWN = "markdown"
    TEXT = "txt"
    JSON = "json"

Thank you for your consideration and thank you for a great software tool.

Are you willing to submit a PR?

[X] Yes I am willing to submit a PR!

Jan 02 '25 16:01 1337stn

Thanks @1337stn. I do think this will be needed and would like @shahrokhDaijavad and @sungeunan-ibm to weigh in. But I think you should proceed with a PR. Thanks

Jan 08 '25 12:01 touma-I

I think this is a good suggestion. Supporting JSON as an additional output format for the html2parquet transform and making it consistent with the pdf2parquet transform output formats is a nice addition, and the work is straightforward since Trafilatura already supports this. @1337stn I also think you should proceed with a PR. Thanks.

Jan 08 '25 15:01 shahrokhDaijavad

@1337stn is there a PR in progress for this?

Feb 11 '25 07:02 agoyal26

@agoyal26 sorry, not yet. I will submit. Sorry for delay.

Apr 02 '25 00:04 1337stn

@agoyal26 PR submitted

https://github.com/data-prep-kit/data-prep-kit/pull/1187

Apr 06 '25 21:04 1337stn

[Feature] Html2ParquetTransform support output_format_value json

Search before asking

Component

Feature

Are you willing to submit a PR?