webwhiz icon indicating copy to clipboard operation
webwhiz copied to clipboard

Text inside PDF fails to be crawled

Open MohamedAmineDHIAB opened this issue 9 months ago • 0 comments

Dear WebWhiz Team,

When trying to create a chatbot and upload the following PDF either I get a 500 error code

image

image

If I add other data files, the chatbot gets created but it fails to answer my questions regarding the earlier mentioned PDF, saying:

I don't know the answer to that

One of the issues might be that the Data Crawler does not support OCR, and only retrieves text from PDF files that already contain embedded Texts within them. However, for PDF files that look like they contain Textual Data from a first glance, however they do not contain any embedded Text, the Crawler fails to get the data resulting in such issues.

I hope this can be helpful for debugging this issue.

I also saw that a similar issue has been reported here: #107

MohamedAmineDHIAB avatar May 21 '24 09:05 MohamedAmineDHIAB