[Feature] Html2ParquetTransform support output_format_value json
Search before asking
- [X] I searched the issues and found no similar issues.
Component
Transforms/Other
Feature
Running through "RAG with Data Prep Kit" focusing on Step-2:
https://github.com/IBM/data-prep-kit/blob/6a06d8763ede388a93e956635af530df3494a9c8/examples/notebooks/rag/rag_1A_dpk_process_python.ipynb
Step-2: Process Input Documents (RAG stage 1, 2 & 3)
This code uses DPK to
Extract text from PDFs (RAG stage-1)
Performs de-dupes (RAG stage-1)
split the documents into chunks (RAG stage-2)
vectorize the chunks (RAG stage-3)
In Extract text from PDFs (RAG stage-1) when calling pdf2parquet_transform_python there are three options for output pdf2parquet_contents_types: markdown, text, json
https://github.com/IBM/data-prep-kit/blob/6a06d8763ede388a93e956635af530df3494a9c8/transforms/language/pdf2parquet/dpk_pdf2parquet/transform.py#L67
class pdf2parquet_contents_types(str, enum.Enum):
MARKDOWN = "text/markdown"
TEXT = "text/plain"
JSON = "application/json"
The next step split the documents into chunks (RAG stage-2) when calling doc_chunk there are three options for chunking_type: dl_json, li_markdown, li_token_text with default = dl_json
https://github.com/IBM/data-prep-kit/blob/6a06d8763ede388a93e956635af530df3494a9c8/transforms/language/doc_chunk/README.md?plain=1#L66
Thus we see the example code use Stage-1 output = json and Stage-2 type = json.
When attempting to change Stage-1 to Extract text from HTML using html2parquet_transform_python there are two options for html2parquet_output_format: markdown, txt
https://github.com/IBM/data-prep-kit/blob/6a06d8763ede388a93e956635af530df3494a9c8/transforms/language/html2parquet/dpk_html2parquet/transform.py#L168
class html2parquet_output_format(str, enum.Enum):
MARKDOWN = "markdown"
TEXT = "txt"
However html2parquet_transform_python reports to use Trafilatura where Trafilatura also supports JSON output:
https://trafilatura.readthedocs.io/en/latest/usage-python.html
Output
By default, the output is in plain text (TXT) format without metadata. The following additional formats are available:
CSV
HTML (from version 1.11 onwards)
JSON
Markdown (from version 1.9 onwards)
XML and XML-TEI (following the guidelines of the Text Encoding Initiative)
To specify the output format, use one of the following strings: "csv", "json", "html", "markdown", "txt", "xml", "xmltei".
I will be attempting to change Stage-2 to work on markdown/text to align with the current supported outputs formats of html2parquet_transform_python.
However it seems html2parquet_transform_python could allow html2parquet_output_format: json which would pass-through to Trafilatura which already supports JSON. This would allow the flow of Stage-2 and beyond in the RAG example(s) to be maintained since they default to JSON.
Could you please consider adding JSON support in html2parquet_output_format (similar to code below, and whatever other downstream changes may be required) to align with pdf2parquet_output_format options along with the underlying Trafilatura supported options.
class html2parquet_output_format(str, enum.Enum):
MARKDOWN = "markdown"
TEXT = "txt"
JSON = "json"
Thank you for your consideration and thank you for a great software tool.
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
Thanks @1337stn. I do think this will be needed and would like @shahrokhDaijavad and @sungeunan-ibm to weigh in. But I think you should proceed with a PR. Thanks
I think this is a good suggestion. Supporting JSON as an additional output format for the html2parquet transform and making it consistent with the pdf2parquet transform output formats is a nice addition, and the work is straightforward since Trafilatura already supports this. @1337stn I also think you should proceed with a PR. Thanks.
@1337stn is there a PR in progress for this?
@agoyal26 sorry, not yet. I will submit. Sorry for delay.
@agoyal26 PR submitted
https://github.com/data-prep-kit/data-prep-kit/pull/1187