paper-qa icon indicating copy to clipboard operation
paper-qa copied to clipboard

Document is not recognized as text document : disable document check

Open Snikch63200 opened this issue 1 year ago • 4 comments

Hello,

Sometimes, documents that contains images are not recognized as text documents. For this issue, paperqa recommands to disable document checking.

Could not read Auchy les mines (62) - Haisnes-lez-la Bassée.pdf This does not look like a text document: ./Papers/Auchy les mines (62) - Haisnes-lez-la Bassée.pdf. Pass disable_check to ignore this error.

How can I disable document checking to prevent this error ?

Best regards.

Snikch63200 avatar Oct 09 '24 09:10 Snikch63200

To disable document checking and prevent the error, set the disable_doc_valid_check parameter to True in your configuration. Here's how you can do it:

from paperqa.settings import ParsingSettings

# Create a ParsingSettings instance with document validity check disabled
parsing_settings = ParsingSettings(disable_doc_valid_check=True)

This will bypass the document validity check, allowing documents with images to be processed without raising an error [1].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Oct 09 '24 09:10 dosubot[bot]

@dosu

the error seems to persist even if

parsing_settings = ParsingSettings(disable_doc_valid_check=True)

is passed.

DGoettlich avatar Oct 10 '24 07:10 DGoettlich

Hi @DGoettlich - you need to pass the setting:

from paperqa import Settings, Docs
settings = Settings()
settings.parsing.disable_doc_valid_check = True

d = Docs()

d.add(...., settings=settings)

whitead avatar Oct 10 '24 22:10 whitead

Ensure that the ParsingSettings instance with disable_doc_valid_check=True is correctly passed to the function or module responsible for processing the document. The aadd function in the Docs class is responsible for document validation and will ignore the check if this setting is properly applied [1]. Double-check that the configuration is being used where needed.

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Oct 10 '24 22:10 dosubot[bot]

hi @whitead and sorry for the late reply --

passing

from paperqa import Settings, Docs settings = Settings() settings.embedding=embedding_model settings.parsing.disable_doc_valid_check=True

does not seem to solve the issue, im still getting

docs.add(doc_path, docname=doc_name, settings=settings)

File "/Users/daniel/Library/Caches/pypoetry/virtualenvs/ergot-ZjaYIWwM-py3.12/lib/python3.12/site-packages/paperqa/docs.py", line 252, in add return get_loop().run_until_complete( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/daniel/Library/Caches/pypoetry/virtualenvs/ergot-ZjaYIWwM-py3.12/lib/python3.12/site-packages/nest_asyncio.py", line 98, in run_until_complete return f.result() ^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/futures.py", line 203, in result raise self._exception.with_traceback(self._exception_tb) File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result result = coro.send(None) ^^^^^^^^^^^^^^^ File "/Users/daniel/Library/Caches/pypoetry/virtualenvs/ergot-ZjaYIWwM-py3.12/lib/python3.12/site-packages/paperqa/docs.py", line 410, in aadd raise ValueError( ValueError: This does not look like a text document: ... Pass disable_check to ignore this error.

Any idea what i might be doing wrong?

DGoettlich avatar Dec 29 '24 15:12 DGoettlich

Hello @DGoettlich , can you show a minimal repro for the issue? This check happens here: https://github.com/Future-House/paper-qa/blob/main/paperqa/docs.py#L385

Maybe your specific file is also failing on some other check.

maykcaldas avatar Jan 23 '25 00:01 maykcaldas

I'm closing this issue for now. Feel free to reopen it if the problem remains!

maykcaldas avatar Mar 13 '25 16:03 maykcaldas