data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

Html2parquet Makefile added

Open sungeunan-ibm opened this issue 1 year ago • 4 comments

Why are these changes needed?

Added a Makefile to enable tests

Related issue number (if any).

[Bug] html2parquet needs a Makefile to enable tests #513 https://github.com/IBM/data-prep-kit/issues/513

sungeunan-ibm avatar Aug 20 '24 23:08 sungeunan-ibm

Thank you, @sungeunan-ibm for doing this so quickly! Please resolve the conflicts.

shahrokhDaijavad avatar Aug 20 '24 23:08 shahrokhDaijavad

@sungeunan-ibm please resolve the conflicts and make sure the ci/cd tests are running after that. Thanks!

daw3rd avatar Aug 21 '24 12:08 daw3rd

@sungeunan-ibm It will be easier for me to fix the CI/CD issues that this PR is running into and let you focus on the ray version of the transform. If you can confirm that all the Unit Tests for your python transform are passing, then I will take it from here to fix the remaining issues with this PR. I will tag you if needed. Please create a new PR for the Ray implementation and keep it in draft mode until all the Unit Test for the Ray code has passed. Thanks for all your help. This is great work! cc: @daw3rd @shahrokhDaijavad

touma-I avatar Aug 21 '24 23:08 touma-I

@sungeunan-ibm please sign your commits using git commit -s ... Note that git desktop does not seem to support signing.

daw3rd avatar Aug 22 '24 20:08 daw3rd

@sungeunan-ibm For now let's ignore the kfp test failures and focus on the html2parquet test failures. You can run make test-src and make test-image in the python directory to run your tests.

daw3rd avatar Aug 27 '24 21:08 daw3rd

The unit test runs with make test-src, but there’s an issue with the date_acquired column, which uses datetime.now().isoformat(). This causes timestamp changes with each test run, which is causing mismatches with the expected output. Should we remove the date_acquired column, or is there a way to run the test without it?

sungeunan-ibm avatar Aug 30 '24 00:08 sungeunan-ibm

Thanks, @sungeunan-ibm @daw3rd Do we care about the timestamps not matching? Do other transforms behave the same way?

shahrokhDaijavad avatar Aug 30 '24 15:08 shahrokhDaijavad

The unit test runs with make test-src, but there’s an issue with the date_acquired column, which uses datetime.now().isoformat(). This causes timestamp changes with each test run, which is causing mismatches with the expected output. Should we remove the date_acquired column, or is there a way to run the test without it?

Thanks @sungeunan-ibm I will check to see if the framework allow us to exclude the time column from the validation step. I think we had a similar requirement from another transform and seems like a reasonable thing for the framework to do. I am not if @daw3rd had a chance to address that but I will investigate.

touma-I avatar Aug 30 '24 18:08 touma-I

@sungeunan-ibm As you can see, @touma-I has now added issue #559 as a priority. This would take precedence over the Ray-scalable version of this transform. If you can resolve the conflict shown, I think we can live with the timestamp issue. We can then merge this branch and then you can look into adding markdown as the default option for what will be in the content column of the parquet file, with text being the non-default option (assuming Trafilatura gives this).

shahrokhDaijavad avatar Aug 30 '24 18:08 shahrokhDaijavad

Still shows a conflict. @daw3rd Can you please help? It seems easy enough, but I don't want to make a mistake with the .make.versions file.

shahrokhDaijavad avatar Aug 30 '24 23:08 shahrokhDaijavad

@sungeunan-ibm it looks like maybe you need to update the test-data/expected test data to match the new date-acquired column. The launcher-based tests can be made to ignore selected columns (see latest noop tests). However, the non-launcher pure transform tests do not allow this per issue #564. If you're running into that, you may need to disable the non-transform tests - the launcher tests are a super set really so should be sufficient for now.

daw3rd avatar Sep 03 '24 19:09 daw3rd

@daw3rd I agree, and I want to disable the non-transform tests. Is this the correct way to do it? Is this how you would disable it in the Makefile #test-src:: .transforms.test-src?

sungeunan-ibm avatar Sep 04 '24 22:09 sungeunan-ibm

@daw3rd I agree, and I want to disable the non-transform tests. Is this the correct way to do it? Is this how you would disable it in the Makefile #test-src:: .transforms.test-src? @sungeunan-ibm

No, just delete the test/test_html2parquet.py. We don't want to turn off all tests, just the non-launcher test.

See https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/test/test_pdf2parquet_python.py for an example of ignoring columns in the launcher-based test.

daw3rd avatar Sep 09 '24 19:09 daw3rd