data-prep-kit
data-prep-kit copied to clipboard
[Feature] Add support to process HTML file format
Search before asking
- [X] I searched the issues and found no similar issues.
Component
Tools/ingest2parquet
Feature
We would like to add ability to read HTML files and convert them to parquet files, which can go through other processing modules like dedup, filtering etc.
Library that can be used https://trafilatura.readthedocs.io/en/latest/
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
I believe this is satisified with the html2parquet transform. https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/html2parquet. Although I wonder if it should be moved to language?
@daw3rd The Python version of this has been merged, and we can close the issue. What is being done now is the Ray version that @sungeunan-ibm is working on. As for moving it from universal to language, let's do that after it is finished.
@shahrokhDaijavad @daw3rd I would close this issue and open a new one for the Ray version. I agree with David that this should be moved to language folder