data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Feature] Add support to process HTML file format

Open Bytes-Explorer opened this issue 1 year ago • 3 comments
trafficstars

Search before asking

  • [X] I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

Feature

We would like to add ability to read HTML files and convert them to parquet files, which can go through other processing modules like dedup, filtering etc.

Library that can be used https://trafilatura.readthedocs.io/en/latest/

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

Bytes-Explorer avatar May 21 '24 15:05 Bytes-Explorer

I believe this is satisified with the html2parquet transform. https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/html2parquet. Although I wonder if it should be moved to language?

daw3rd avatar Sep 13 '24 16:09 daw3rd

@daw3rd The Python version of this has been merged, and we can close the issue. What is being done now is the Ray version that @sungeunan-ibm is working on. As for moving it from universal to language, let's do that after it is finished.

shahrokhDaijavad avatar Sep 13 '24 18:09 shahrokhDaijavad

@shahrokhDaijavad @daw3rd I would close this issue and open a new one for the Ray version. I agree with David that this should be moved to language folder

Bytes-Explorer avatar Sep 18 '24 04:09 Bytes-Explorer