[Bug] pdf2parquet is now failing ci/cd builds
Search before asking
- [X] I searched the issues and found no similar issues.
Component
Transforms/Other
What happened + What you expected to happen
Ci/CD builds are now failing for pdf2parquet in at least to unrelated PRs and I can reproduce the failure locally on my mac m1.
- https://github.com/IBM/data-prep-kit/actions/runs/10594844975/job/29359345957?pr=545
- https://github.com/IBM/data-prep-kit/actions/runs/10599667381/job/29378015278?pr=548
Reproduction script
cd transforms/language/pdf2parquet/python
make test-src
Anything else
E [21] Peng Zhang, Can Li, Liang Qiao, Zhanzhan Cheng, Shiliang Pu, Yi Niu, and Fei Wu. Vsr: A unified framework for document layout analysis combining vision, semantics and relations, 2021.
E
E [22] Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD, pages 774-782. ACM, 2018.
E
E [23] Connor Shorten and Taghi M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data , 6(1):60, 2019."]]
E num_pages: [[9]]
E num_tables: [[5]]
E num_doc_elements: [[147]]
E ext: [["pdf"]]
E hash: [["313bb7ef50bea94a1ef5ae4417f45923cb4ac383d49ba781c874eb9bfbc06be0"]]
E size: [[41244]]
E source_filename: [["2206.01062.pdf"]]
E assert <pyarrow.lib....\n 148\n ]\n] == <pyarrow.lib....\n 147\n ]\n]
E
E Use -v to get more diff
../../../../../data-processing-lib/python/src/data_processing/test_support/abstract_test.py:135: AssertionError ------------------------------------------------------------------------- Captured log call ------------------------------------------------------------------------- INFO pdf2parquet_transform:pdf2parquet_transform.py:286 pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': False, 'do_ocr': False} INFO data_processing.runtime.execution_configuration:execution_configuration.py:80 pipeline id pipeline_id INFO data_processing.runtime.execution_configuration:execution_configuration.py:83 code location None INFO data_processing.data_access.data_access_factory_base17f66004-ebd0-4524-8132-d2a68d6e87bd:data_access_factory.py:195 data factory data_ is using local data access: input_folder - /Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input output_folder - /tmp/pdf2parquet1ug5awwy INFO data_processing.data_access.data_access_factory_base17f66004-ebd0-4524-8132-d2a68d6e87bd:data_access_factory.py:211 data factory data_ max_files -1, n_sample -1 INFO data_processing.data_access.data_access_factory_base17f66004-ebd0-4524-8132-d2a68d6e87bd:data_access_factory.py:225 data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf', '.zip'], files to checkpoint ['.parquet'] INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:46 orchestrator pdf2parquet started at 2024-08-28 14:28:22 INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:64 Number of files is 2, source profile {'max_file_size': 4.401191711425781, 'min_file_size': 4.110984802246094, 'total_file_size': 8.512176513671875} INFO pdf2parquet_transform:pdf2parquet_transform.py:88 Initializing models INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:166 Completed 1 files (50.0%) in 0.193 min INFO pdf2parquet_transform:pdf2parquet_transform.py:186 Processing archive_doc_filename='2206.00785v1.pdf' INFO pdf2parquet_transform:pdf2parquet_transform.py:186 Processing archive_doc_filename='2305.03393v1.pdf' INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:166 Completed 2 files (100.0%) in 0.599 min INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:170 Done processing 2 files, waiting for flush() completion. INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:174 done flushing in 0.0 sec INFO data_processing.runtime.pure_python.transform_launcher:transform_launcher.py:88 Completed execution in 0.663 min, execution result 0 WARNING data_processing.test_support.abstract_test:abstract_test.py:214 Differences in metadata.json being ignored for now. INFO data_processing.test_support.abstract_test:abstract_test.py:261 Copying file with difference: /tmp/pdf2parquet1ug5awwy/2206.01062.parquet to /tmp/2206.01062.parquet ========================================================================= warnings summary ========================================================================== test/test_pdf2parquet_python.py::TestPythonPdf2ParquetTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected-ignore_columns0] test/test_pdf2parquet_python.py::TestPythonPdf2JsonParquetTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected_json-ignore_columns0] /Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py:307: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance) warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================================== short test summary info ====================================================================== FAILED test_pdf2parquet_python.py::TestPythonPdf2ParquetTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected-ignore_columns0] - AssertionError: Row 0 of table 0 are not equal FAILED test_pdf2parquet_python.py::TestPythonPdf2JsonParquetTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected_json-ignore_columns0] - AssertionError: Row 0 of table 0 are not equal FAILED test_pdf2parquet_python.py::TestPythonPdf2ParquetNoTableTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected_md_no_table-ignore_columns0] - AssertionError: Row 0 of table 0 are not equal ======================================================== 3 failed, 1 passed, 2 warnings in 202.35s (0:03:22) ======================================================== make: *** [.defaults.test-src] Error 1
OS
Ubuntu, MacOS (limited support)
Python
3.11.x
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
@daw3rd this is now solved, right?
@daw3rd Is this solved? If not, pls provide error msg for @dolfim-ibm to continue investigating.
@dolfim-ibm this is still failing running locally on mac m1 Again,
cd transforms/language/pdf2parquet/python
make test-src
...
test_pdf2parquet.py s
test_pdf2parquet_python.py Using temporary output path /tmp/pdf2parquetmy6c416f
12:35:18 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'double_precision': 0}
12:35:18 INFO - pipeline id pipeline_id
12:35:18 INFO - code location None
12:35:18 INFO - data factory data_ is using local data access: input_folder - /Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input output_folder - /tmp/pdf2parquetmy6c416f
12:35:18 INFO - data factory data_ max_files -1, n_sample -1
12:35:18 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf', '.zip'], files to checkpoint ['.parquet']
12:35:18 INFO - orchestrator pdf2parquet started at 2024-09-12 12:35:18
12:35:18 INFO - Number of files is 2, source profile {'max_file_size': 0.3013172149658203, 'min_file_size': 0.2757863998413086, 'total_file_size': 0.5771036148071289}
12:35:18 INFO - Initializing models
README.md: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.49k/3.49k [00:00<00:00, 67.1MB/s]
Fetching 7 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 15.45it/s]
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 1348, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1303, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1349, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1298, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1058, in _send_output
self.send(msg)
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 996, in send
self.connect()
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1475, in connect
self.sock = self._context.wrap_socket(self.sock,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 517, in wrap_socket
return self.sslsocket_class._create(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 1104, in _create
self.do_handshake()
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 1382, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)
...
@daw3rd This looks like a temporary network issue of your connection. Can you please verify again?
It works for me. Has to be a network glitch
@daw3rd Can you pls try again and let the team know?
@daw3rd I flagged as fixed. Since you were the only one seeing this failure, do you mind giving it another try?
i'm still getting this SSL issue, on mac M1. Anyone else with an M1 want to try?
This seems to be a local mac/ssl/python configuration issue as I'm seeing this on unrelated efforts with python on mac.