docling icon indicating copy to clipboard operation
docling copied to clipboard

name 'reader' is not defined in your sample code

Open Manamama opened this issue 1 year ago • 1 comments

Successfully installed docling-2.8.3 I have installed this: https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/readers/llama-index-readers-docling

The first code works, but the second:

from llama_index.core import SimpleDirectoryReader
from llama_index.readers.docling import * 
 
dir_reader = SimpleDirectoryReader(
    input_dir="./Moderation probs",
    file_extractor={".pdf": reader},
)
docs = dir_reader.load_data()
print(docs[0].metadata)
# > {'file_path': '/path/to/docs/2408.09869v3.pdf',
# >  'file_name': '2408.09869v3.pdf',
# >  'file_type': 'application/pdf',
# >  'file_size': 5566574,
# >  'creation_date': '2024-10-06',
# >  'last_modified_date': '2024-10-03'}

result in Ipython:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 6
      1 from llama_index.core import SimpleDirectoryReader
      2 from llama_index.readers.docling import * 
      4 dir_reader = SimpleDirectoryReader(
      5     input_dir="... /Moderation probs",
----> 6     file_extractor={".pdf": reader},
      7 )
      8 docs = dir_reader.load_data()
      9 print(docs[0].metadata)

NameError: name 'reader' is not defined

In [6]: 

I have tried: pip install llama-index-readers-file but same problem:

Error loading data: `llama-index-readers-file` package not found

Manamama avatar Dec 05 '24 17:12 Manamama

Oh, Perplexity AI found it.

One needs to fix:

llama-index                                  0.10.19
llama-index-agent-openai                     0.1.5
llama-index-cli                              0.1.9
llama-index-core                             0.10.68.post1
llama-index-embeddings-azure-openai          0.1.5
llama-index-embeddings-huggingface           0.1.4
llama-index-embeddings-openai                0.1.6
llama-index-indices-managed-llama-cloud      0.1.4
llama-index-legacy                           0.9.48
llama-index-llms-azure-openai                0.1.5
llama-index-llms-huggingface                 0.1.4
llama-index-llms-openai                      0.1.9
llama-index-multi-modal-llms-openai          0.1.4
llama-index-program-openai                   0.1.4
llama-index-question-gen-openai              0.1.3
llama-index-readers-docling                  0.3.0
llama-index-readers-file                     0.1.9
llama-index-readers-llama-parse              0.1.3
llama-index-retrievers-bm25                  0.1.3

via pip install --upgrade llama-index first.

Now:

python test1.py 
Fetching 9 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 30590.55it/s]
Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.

etc. as it should be.

Anyway, do think of mentioning it in help somewhere:

Compatibility Issue with llama-index and Related Packages Problem Users may encounter errors related to missing packages or functionality when using DoclingReader or other components of the llama-index ecosystem. Specifically, you might see an error like: text Error loading data: 'llama-index-readers-file' package not found

Cause This issue often arises due to version mismatches between llama-index, llama-index-core, and related reader packages. For example, having an older version of llama-index while using a newer version of llama-index-core can lead to compatibility problems. Solution To resolve this issue, follow these steps: Upgrade llama-index: Ensure that you have the latest version of llama-index installed. Run the following command in your terminal: bash pip install --upgrade llama-index

Verify Installed Packages: After upgrading, check the installed versions of relevant packages to ensure they are compatible. You can do this with: bash pip list | grep llama-index

Ensure that the versions of llama-index, llama-index-core, and any reader packages (like llama-index-readers-docling and llama-index-readers-file) are aligned.

Manamama avatar Dec 05 '24 17:12 Manamama

@Manamama Thanks for reporting, and glad you could figure out the problem. I will close this issue as resolved.

cau-git avatar Dec 18 '24 12:12 cau-git

Yes, but I suggest adding that pip install --upgrade llama-index somewhere to the script.

Speaking of which: I am toying with main pip -v install docling on a virgin MSYS2 this time and I have deja vu from Termux, where even this last method did not work: https://numpy.org/doc/2.2/building/index.html, see : https://github.com/pypdfium2-team/pypdfium2/issues/332 , as a hack of a hack (scroll down at https://github.com/pypdfium2-team/pypdfium2/issues/332 for the author's extra tip) may be needed again ...

Manamama avatar Dec 18 '24 13:12 Manamama

Note that the trick in https://github.com/pypdfium2-team/pypdfium2/issues/332#issuecomment-2546357309 is hardcoded for a specific build (android arm64, pdfium 6462). However, you may be able to use the same strategy for other platforms by replacing link/version/platform accordingly.

I'm not sure about msys2, but provided the platform is recognized as windows, doing PDFIUM_BINDINGS=reference pip install -v . on pypdfium2 might work? If not, please file a bug report at pypdfium2 again, with the log and platform info.

mara004 avatar Dec 18 '24 17:12 mara004

If somebody is wondering how to install it in Termux - I have compiled a surival guide, see here: https://github.com/Manamama/Ubuntu_Scripts_1/blob/main/docs/module_patches_in_Termux.md

Manamama avatar Jun 11 '25 09:06 Manamama