data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Feature] Html2ParquetTransform support output_format_value json #908

Open 1337stn opened this issue 8 months ago • 2 comments

Why are these changes needed?

dpk_html2parquet.transform_python could allow html2parquet_output_format: json which would pass-through to Trafilatura which already supports JSON. This would allow the flow of Stage-2 and beyond in the RAG example(s) to be maintained since they default to JSON.

Adding JSON support in html2parquet_output_format to align with pdf2parquet_output_format options along with the underlying Trafilatura supported options.

Supporting JSON as an additional output format for the html2parquet transform and making it consistent with the pdf2parquet transform output formats, since Trafilatura already supports this.

Related issue number (if any).

None

1337stn avatar Apr 06 '25 21:04 1337stn

Hi, @1337stn. Thank you for your contribution. Now that you have added the JSON output option, can I ask you one (actually two!) favor(s) for the sake of the completeness of your work, before we merge it?

  1. Please modify the README file, to say that in addition to the default markdown and the second txt output options, there is now the 3rd option of JSON as output.
  2. To show that this option works, please add a cell to the notebook file (https://github.com/data-prep-kit/data-prep-kit/blob/dev/transforms/language/html2parquet/html2parquet.ipynb) that exercises this output json format, after exercising the default markdown. It would be fantastic if you could also add another cell that would exercise the txt option too (the original developers should have done this!). Thank you again.

shahrokhDaijavad avatar Apr 07 '25 15:04 shahrokhDaijavad

@shahrokhDaijavad I have completed the requested changes. Sorry for delay, but fell down a rabbit hole at work.

  1. README file now includes json as an option
  2. notebook has cells added to demo txt and json output options
  3. notebook had typo plus it seemed to require lxml_html_clean now as my system complained without it

1337stn avatar Jul 04 '25 15:07 1337stn