data-prep-kit [Bug] pdf2parquet is now failing ci/cd builds

Search before asking

[X] I searched the issues and found no similar issues.

Component

Transforms/Other

What happened + What you expected to happen

Ci/CD builds are now failing for pdf2parquet in at least to unrelated PRs and I can reproduce the failure locally on my mac m1.

https://github.com/IBM/data-prep-kit/actions/runs/10594844975/job/29359345957?pr=545
https://github.com/IBM/data-prep-kit/actions/runs/10599667381/job/29378015278?pr=548

Reproduction script

cd transforms/language/pdf2parquet/python
make test-src

Anything else

E [21] Peng Zhang, Can Li, Liang Qiao, Zhanzhan Cheng, Shiliang Pu, Yi Niu, and Fei Wu. Vsr: A unified framework for document layout analysis combining vision, semantics and relations, 2021. E
E [22] Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD, pages 774-782. ACM, 2018. E
E [23] Connor Shorten and Taghi M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data , 6(1):60, 2019."]] E num_pages: [[9]] E num_tables: [[5]] E num_doc_elements: [[147]] E ext: [["pdf"]] E hash: [["313bb7ef50bea94a1ef5ae4417f45923cb4ac383d49ba781c874eb9bfbc06be0"]] E size: [[41244]] E source_filename: [["2206.01062.pdf"]] E assert <pyarrow.lib....\n 148\n ]\n] == <pyarrow.lib....\n 147\n ]\n] E
E Use -v to get more diff

../../../../../data-processing-lib/python/src/data_processing/test_support/abstract_test.py:135: AssertionError ------------------------------------------------------------------------- Captured log call ------------------------------------------------------------------------- INFO pdf2parquet_transform:pdf2parquet_transform.py:286 pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': False, 'do_ocr': False} INFO data_processing.runtime.execution_configuration:execution_configuration.py:80 pipeline id pipeline_id INFO data_processing.runtime.execution_configuration:execution_configuration.py:83 code location None INFO data_processing.data_access.data_access_factory_base17f66004-ebd0-4524-8132-d2a68d6e87bd:data_access_factory.py:195 data factory data_ is using local data access: input_folder - /Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input output_folder - /tmp/pdf2parquet1ug5awwy INFO data_processing.data_access.data_access_factory_base17f66004-ebd0-4524-8132-d2a68d6e87bd:data_access_factory.py:211 data factory data_ max_files -1, n_sample -1 INFO data_processing.data_access.data_access_factory_base17f66004-ebd0-4524-8132-d2a68d6e87bd:data_access_factory.py:225 data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf', '.zip'], files to checkpoint ['.parquet'] INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:46 orchestrator pdf2parquet started at 2024-08-28 14:28:22 INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:64 Number of files is 2, source profile {'max_file_size': 4.401191711425781, 'min_file_size': 4.110984802246094, 'total_file_size': 8.512176513671875} INFO pdf2parquet_transform:pdf2parquet_transform.py:88 Initializing models INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:166 Completed 1 files (50.0%) in 0.193 min INFO pdf2parquet_transform:pdf2parquet_transform.py:186 Processing archive_doc_filename='2206.00785v1.pdf' INFO pdf2parquet_transform:pdf2parquet_transform.py:186 Processing archive_doc_filename='2305.03393v1.pdf' INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:166 Completed 2 files (100.0%) in 0.599 min INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:170 Done processing 2 files, waiting for flush() completion. INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:174 done flushing in 0.0 sec INFO data_processing.runtime.pure_python.transform_launcher:transform_launcher.py:88 Completed execution in 0.663 min, execution result 0 WARNING data_processing.test_support.abstract_test:abstract_test.py:214 Differences in metadata.json being ignored for now. INFO data_processing.test_support.abstract_test:abstract_test.py:261 Copying file with difference: /tmp/pdf2parquet1ug5awwy/2206.01062.parquet to /tmp/2206.01062.parquet ========================================================================= warnings summary ========================================================================== test/test_pdf2parquet_python.py::TestPythonPdf2ParquetTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected-ignore_columns0] test/test_pdf2parquet_python.py::TestPythonPdf2JsonParquetTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected_json-ignore_columns0] /Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py:307: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance) warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================================== short test summary info ====================================================================== FAILED test_pdf2parquet_python.py::TestPythonPdf2ParquetTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected-ignore_columns0] - AssertionError: Row 0 of table 0 are not equal FAILED test_pdf2parquet_python.py::TestPythonPdf2JsonParquetTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected_json-ignore_columns0] - AssertionError: Row 0 of table 0 are not equal FAILED test_pdf2parquet_python.py::TestPythonPdf2ParquetNoTableTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected_md_no_table-ignore_columns0] - AssertionError: Row 0 of table 0 are not equal ======================================================== 3 failed, 1 passed, 2 warnings in 202.35s (0:03:22) ======================================================== make: *** [.defaults.test-src] Error 1

OS

Ubuntu, MacOS (limited support)

Python

3.11.x

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Aug 28 '24 18:08 daw3rd

@daw3rd this is now solved, right?

Aug 30 '24 08:08 dolfim-ibm

@daw3rd Is this solved? If not, pls provide error msg for @dolfim-ibm to continue investigating.

Sep 04 '24 11:09 Bytes-Explorer

@dolfim-ibm this is still failing running locally on mac m1 Again,

cd transforms/language/pdf2parquet/python
make test-src

...
test_pdf2parquet.py s
test_pdf2parquet_python.py Using temporary output path /tmp/pdf2parquetmy6c416f
12:35:18 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'double_precision': 0}
12:35:18 INFO - pipeline id pipeline_id
12:35:18 INFO - code location None
12:35:18 INFO - data factory data_ is using local data access: input_folder - /Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input output_folder - /tmp/pdf2parquetmy6c416f
12:35:18 INFO - data factory data_ max_files -1, n_sample -1
12:35:18 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf', '.zip'], files to checkpoint ['.parquet']
12:35:18 INFO - orchestrator pdf2parquet started at 2024-09-12 12:35:18
12:35:18 INFO - Number of files is 2, source profile {'max_file_size': 0.3013172149658203, 'min_file_size': 0.2757863998413086, 'total_file_size': 0.5771036148071289}
12:35:18 INFO - Initializing models
README.md: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.49k/3.49k [00:00<00:00, 67.1MB/s]
Fetching 7 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 15.45it/s]
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 1348, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1303, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1349, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1298, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1058, in _send_output
    self.send(msg)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 996, in send
    self.connect()
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1475, in connect
    self.sock = self._context.wrap_socket(self.sock,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 517, in wrap_socket
    return self.sslsocket_class._create(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 1104, in _create
    self.do_handshake()
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 1382, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)

...

Sep 12 '24 16:09 daw3rd

@daw3rd This looks like a temporary network issue of your connection. Can you please verify again?

Sep 16 '24 06:09 dolfim-ibm

It works for me. Has to be a network glitch

Sep 16 '24 09:09 blublinsky

@daw3rd Can you pls try again and let the team know?

Sep 17 '24 16:09 Bytes-Explorer

@daw3rd I flagged as fixed. Since you were the only one seeing this failure, do you mind giving it another try?

Sep 20 '24 14:09 dolfim-ibm

i'm still getting this SSL issue, on mac M1. Anyone else with an M1 want to try?

Sep 20 '24 14:09 daw3rd

This seems to be a local mac/ssl/python configuration issue as I'm seeing this on unrelated efforts with python on mac.

Oct 29 '24 14:10 daw3rd