data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Bug] Testing Rag notebook with latest release of pdf2Parquet, eDedup and DocID

Open touma-I opened this issue 1 year ago • 6 comments

Search before asking

  • [X] I searched the issues and found no similar issues.

Component

Transforms/universal/doc_id, Transforms/universal/ededup, Transforms/Other, Other

What happened + What you expected to happen

  1. @dolfim-ibm When running the rag notebook with the latest release of pdf2Parquet, the notebook crashes when downloading the model for the first time. Re-running the cell we do not see the error: If the model is already in the .EasyOCR folder, then the error will not show up. Details of the error can be found cell 6 of this notebook: https://github.com/IBM/data-prep-kit/blob/t2/examples/notebooks/rag/rag_1A_dpk_process_ray.dev3.error.ipynb

  2. @sujee There are a few changes that need to be made to the notebook for it to work with the new release. Primarily: replace launcher = RayTransformLauncher(EdedupRayTransformConfiguration()) with launcher = RayTransformLauncher(EdedupRayTransformRuntimeConfiguration()) replace launcher = RayTransformLauncher(DocIDRayTransformConfiguration()) with launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration()) replace launcher = RayTransformLauncher(DocIDRayTransformConfiguration()) with launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration()) replace output_df.sample(3) with output_df.sample(len(output_df))

    For a complete reference on the required changes, please see https://github.com/IBM/data-prep-kit/blob/t2/examples/notebooks/rag/rag_1A_dpk_process_ray.dev3.ipynb.

Reproduction script

data-prep-kit/examples/notebooks/rag/requirement.txt in the rag folder was modified to temporarily load the various modules from git. Once we have this issue resolved or a work around has been identified, I will create a dev3 release. For now, please use the git repo as follow:

git clone https://github.com/IBM/data-prep-kit.git t2
cd t2/examples/notebooks/rag && git checkout t2
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
./venv/bin/jupyter lab

from the browser, select and run the notebook rag_1A_dpk_process_ray.dev3.ipynb

cc: @Shahrok

Anything else

@dolfim-ibm : I noticed that pdf2parquet depends on docling==1.7.0 and doc_chunk depends on docling>=1.8.2,<2.0.0. In the requirements for the notebook, I changed pdf2parquet dependency to docling>=1.7.0

@dolfim-ibm : deepsearch-toolkit 1.0.0 requires platformdirs<4.0.0,>=3.5.1, but the ray runtime prefers 4.3.2 .

OS

MacOS (limited support)

Python

3.11.x

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

touma-I avatar Sep 10 '24 07:09 touma-I

@dolfim-ibm : deepsearch-toolkit 1.0.0 requires platformdirs<4.0.0,>=3.5.1, but the ray runtime prefers 4.3.2 .

This was just fixed yesterday. new install should use directly deepsearch-toolkit 1.0.1 which fixes it.

@dolfim-ibm : I noticed that pdf2parquet depends on docling==1.7.0 and doc_chunk depends on docling>=1.8.2,<2.0.0. In the requirements for the notebook, I changed pdf2parquet dependency to docling>=1.7.0

Yes, I think it should be good to go with docling>=1.7.0,<2.0.0.

dolfim-ibm avatar Sep 10 '24 07:09 dolfim-ibm

Regarding the models download, I'm able to reproduce it. Can you please try again with the latest version of the branch?

dolfim-ibm avatar Sep 11 '24 15:09 dolfim-ibm

@dolfim-ibm We still have the same problem even when using the latest release. Looking at the changes, I don't see how it would have addressed this problem. Please advise. Thanks

-        num_tables = len(doc.output.tables if doc.output.tables is not None else 0)
-        num_doc_elements = len(
-            doc.output.main_text if doc.output.main_text is not None else 0
-        )
+        num_tables = len(doc.output.tables) if doc.output.tables is not None else 0
+        num_doc_elements = len(doc.output.main_text) if doc.output.main_text is not None else 0
 

touma-I avatar Sep 11 '24 17:09 touma-I

https://github.com/sujee/data-prep-kit/commit/08024dc3b049ca69bf4ffa84352754867dbd3f79

makes required changes.

Related : #585

sujee avatar Sep 11 '24 22:09 sujee

@sujee @touma-I I think this is now resolved, can you please confirm?

dolfim-ibm avatar Sep 20 '24 14:09 dolfim-ibm

I have made the necessary changes on my branch. Will submit a PR soon

sujee avatar Sep 20 '24 17:09 sujee