paper-qa
paper-qa copied to clipboard
Document is not recognized as text document : disable document check
Hello,
Sometimes, documents that contains images are not recognized as text documents. For this issue, paperqa recommands to disable document checking.
Could not read Auchy les mines (62) - Haisnes-lez-la Bassée.pdf This does not look like a text document: ./Papers/Auchy les mines (62) - Haisnes-lez-la Bassée.pdf. Pass disable_check to ignore this error.
How can I disable document checking to prevent this error ?
Best regards.
To disable document checking and prevent the error, set the disable_doc_valid_check parameter to True in your configuration. Here's how you can do it:
from paperqa.settings import ParsingSettings
# Create a ParsingSettings instance with document validity check disabled
parsing_settings = ParsingSettings(disable_doc_valid_check=True)
This will bypass the document validity check, allowing documents with images to be processed without raising an error [1].
To continue talking to Dosu, mention @dosu.
@dosu
the error seems to persist even if
parsing_settings = ParsingSettings(disable_doc_valid_check=True)
is passed.
Hi @DGoettlich - you need to pass the setting:
from paperqa import Settings, Docs
settings = Settings()
settings.parsing.disable_doc_valid_check = True
d = Docs()
d.add(...., settings=settings)
Ensure that the ParsingSettings instance with disable_doc_valid_check=True is correctly passed to the function or module responsible for processing the document. The aadd function in the Docs class is responsible for document validation and will ignore the check if this setting is properly applied [1]. Double-check that the configuration is being used where needed.
To continue talking to Dosu, mention @dosu.
hi @whitead and sorry for the late reply --
passing
from paperqa import Settings, Docs settings = Settings() settings.embedding=embedding_model settings.parsing.disable_doc_valid_check=True
does not seem to solve the issue, im still getting
docs.add(doc_path, docname=doc_name, settings=settings)
File "/Users/daniel/Library/Caches/pypoetry/virtualenvs/ergot-ZjaYIWwM-py3.12/lib/python3.12/site-packages/paperqa/docs.py", line 252, in add return get_loop().run_until_complete( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/daniel/Library/Caches/pypoetry/virtualenvs/ergot-ZjaYIWwM-py3.12/lib/python3.12/site-packages/nest_asyncio.py", line 98, in run_until_complete return f.result() ^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/futures.py", line 203, in result raise self._exception.with_traceback(self._exception_tb) File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result result = coro.send(None) ^^^^^^^^^^^^^^^ File "/Users/daniel/Library/Caches/pypoetry/virtualenvs/ergot-ZjaYIWwM-py3.12/lib/python3.12/site-packages/paperqa/docs.py", line 410, in aadd raise ValueError( ValueError: This does not look like a text document: ... Pass disable_check to ignore this error.
Any idea what i might be doing wrong?
Hello @DGoettlich , can you show a minimal repro for the issue? This check happens here: https://github.com/Future-House/paper-qa/blob/main/paperqa/docs.py#L385
Maybe your specific file is also failing on some other check.
I'm closing this issue for now. Feel free to reopen it if the problem remains!