name 'reader' is not defined in your sample code
Successfully installed docling-2.8.3
I have installed this: https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/readers/llama-index-readers-docling
The first code works, but the second:
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.docling import *
dir_reader = SimpleDirectoryReader(
input_dir="./Moderation probs",
file_extractor={".pdf": reader},
)
docs = dir_reader.load_data()
print(docs[0].metadata)
# > {'file_path': '/path/to/docs/2408.09869v3.pdf',
# > 'file_name': '2408.09869v3.pdf',
# > 'file_type': 'application/pdf',
# > 'file_size': 5566574,
# > 'creation_date': '2024-10-06',
# > 'last_modified_date': '2024-10-03'}
result in Ipython:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[5], line 6
1 from llama_index.core import SimpleDirectoryReader
2 from llama_index.readers.docling import *
4 dir_reader = SimpleDirectoryReader(
5 input_dir="... /Moderation probs",
----> 6 file_extractor={".pdf": reader},
7 )
8 docs = dir_reader.load_data()
9 print(docs[0].metadata)
NameError: name 'reader' is not defined
In [6]:
I have tried:
pip install llama-index-readers-file
but same problem:
Error loading data: `llama-index-readers-file` package not found
Oh, Perplexity AI found it.
One needs to fix:
llama-index 0.10.19
llama-index-agent-openai 0.1.5
llama-index-cli 0.1.9
llama-index-core 0.10.68.post1
llama-index-embeddings-azure-openai 0.1.5
llama-index-embeddings-huggingface 0.1.4
llama-index-embeddings-openai 0.1.6
llama-index-indices-managed-llama-cloud 0.1.4
llama-index-legacy 0.9.48
llama-index-llms-azure-openai 0.1.5
llama-index-llms-huggingface 0.1.4
llama-index-llms-openai 0.1.9
llama-index-multi-modal-llms-openai 0.1.4
llama-index-program-openai 0.1.4
llama-index-question-gen-openai 0.1.3
llama-index-readers-docling 0.3.0
llama-index-readers-file 0.1.9
llama-index-readers-llama-parse 0.1.3
llama-index-retrievers-bm25 0.1.3
via pip install --upgrade llama-index first.
Now:
python test1.py
Fetching 9 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 30590.55it/s]
Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.
etc. as it should be.
Anyway, do think of mentioning it in help somewhere:
Compatibility Issue with llama-index and Related Packages Problem Users may encounter errors related to missing packages or functionality when using DoclingReader or other components of the llama-index ecosystem. Specifically, you might see an error like: text Error loading data: 'llama-index-readers-file' package not found
Cause This issue often arises due to version mismatches between llama-index, llama-index-core, and related reader packages. For example, having an older version of llama-index while using a newer version of llama-index-core can lead to compatibility problems. Solution To resolve this issue, follow these steps: Upgrade llama-index: Ensure that you have the latest version of llama-index installed. Run the following command in your terminal: bash pip install --upgrade llama-index
Verify Installed Packages: After upgrading, check the installed versions of relevant packages to ensure they are compatible. You can do this with: bash pip list | grep llama-index
Ensure that the versions of llama-index, llama-index-core, and any reader packages (like llama-index-readers-docling and llama-index-readers-file) are aligned.
@Manamama Thanks for reporting, and glad you could figure out the problem. I will close this issue as resolved.
Yes, but I suggest adding that pip install --upgrade llama-index somewhere to the script.
Speaking of which: I am toying with main pip -v install docling on a virgin MSYS2 this time and I have deja vu from Termux, where even this last method did not work: https://numpy.org/doc/2.2/building/index.html, see : https://github.com/pypdfium2-team/pypdfium2/issues/332 , as a hack of a hack (scroll down at https://github.com/pypdfium2-team/pypdfium2/issues/332 for the author's extra tip) may be needed again ...
Note that the trick in https://github.com/pypdfium2-team/pypdfium2/issues/332#issuecomment-2546357309 is hardcoded for a specific build (android arm64, pdfium 6462). However, you may be able to use the same strategy for other platforms by replacing link/version/platform accordingly.
I'm not sure about msys2, but provided the platform is recognized as windows, doing PDFIUM_BINDINGS=reference pip install -v . on pypdfium2 might work? If not, please file a bug report at pypdfium2 again, with the log and platform info.
If somebody is wondering how to install it in Termux - I have compiled a surival guide, see here: https://github.com/Manamama/Ubuntu_Scripts_1/blob/main/docs/module_patches_in_Termux.md