data-prep-kit
data-prep-kit copied to clipboard
Html2parquet Makefile added
Why are these changes needed?
Added a Makefile to enable tests
Related issue number (if any).
[Bug] html2parquet needs a Makefile to enable tests #513 https://github.com/IBM/data-prep-kit/issues/513
Thank you, @sungeunan-ibm for doing this so quickly! Please resolve the conflicts.
@sungeunan-ibm please resolve the conflicts and make sure the ci/cd tests are running after that. Thanks!
@sungeunan-ibm It will be easier for me to fix the CI/CD issues that this PR is running into and let you focus on the ray version of the transform. If you can confirm that all the Unit Tests for your python transform are passing, then I will take it from here to fix the remaining issues with this PR. I will tag you if needed. Please create a new PR for the Ray implementation and keep it in draft mode until all the Unit Test for the Ray code has passed. Thanks for all your help. This is great work! cc: @daw3rd @shahrokhDaijavad
@sungeunan-ibm please sign your commits using git commit -s ... Note that git desktop does not seem to support signing.
@sungeunan-ibm For now let's ignore the kfp test failures and focus on the html2parquet test failures. You can run make test-src and make test-image in the python directory to run your tests.
The unit test runs with make test-src, but there’s an issue with the date_acquired column, which uses datetime.now().isoformat(). This causes timestamp changes with each test run, which is causing mismatches with the expected output. Should we remove the date_acquired column, or is there a way to run the test without it?
Thanks, @sungeunan-ibm @daw3rd Do we care about the timestamps not matching? Do other transforms behave the same way?
The unit test runs with make test-src, but there’s an issue with the
date_acquiredcolumn, which usesdatetime.now().isoformat(). This causes timestamp changes with each test run, which is causing mismatches with the expected output. Should we remove the date_acquired column, or is there a way to run the test without it?
Thanks @sungeunan-ibm I will check to see if the framework allow us to exclude the time column from the validation step. I think we had a similar requirement from another transform and seems like a reasonable thing for the framework to do. I am not if @daw3rd had a chance to address that but I will investigate.
@sungeunan-ibm As you can see, @touma-I has now added issue #559 as a priority. This would take precedence over the Ray-scalable version of this transform. If you can resolve the conflict shown, I think we can live with the timestamp issue. We can then merge this branch and then you can look into adding markdown as the default option for what will be in the content column of the parquet file, with text being the non-default option (assuming Trafilatura gives this).
Still shows a conflict. @daw3rd Can you please help? It seems easy enough, but I don't want to make a mistake with the .make.versions file.
@sungeunan-ibm it looks like maybe you need to update the test-data/expected test data to match the new date-acquired column. The launcher-based tests can be made to ignore selected columns (see latest noop tests). However, the non-launcher pure transform tests do not allow this per issue #564. If you're running into that, you may need to disable the non-transform tests - the launcher tests are a super set really so should be sufficient for now.
@daw3rd I agree, and I want to disable the non-transform tests. Is this the correct way to do it? Is this how you would disable it in the Makefile #test-src:: .transforms.test-src?
@daw3rd I agree, and I want to disable the non-transform tests. Is this the correct way to do it? Is this how you would disable it in the Makefile
#test-src:: .transforms.test-src? @sungeunan-ibm
No, just delete the test/test_html2parquet.py. We don't want to turn off all tests, just the non-launcher test.
See https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/test/test_pdf2parquet_python.py for an example of ignoring columns in the launcher-based test.