anything-llm [BUG]: Document processing API is not online (bulk file uploads)

How are you running AnythingLLM?

Docker (local)

What happened?

When uploading bulk files into documents, I receive an error message "document processing api is not online" randomly on different files as they're being uploaded.

In experimentation, I had selected 8 PDF files that were all over 300+mb each. One of the 8 failed with the above error. If I wait for the other 7 to complete and then re-upload the one that failed, it uploads successfully.

In small batches, this is manageable, as I can pin-point the one that failed and re-upload it. However, in bulk testing, multiple files will fail and it's impossible to keep track of which ones sent and which ones fails, so the only solution I've found is to delete all the files, then re-upload them at 4 to 6 at a time (which takes HOURS when uploading hundreds of documents).

It appears as if the API that manages the upload is limited to the number of documents it can process at one time and/or if it tries to start an upload and the API is busy handling other files, it will fail as not online.
If a file fails, the system doesn't appear to try and upload the file again. It just errors and the user must try and track which file failed and then re-submit it for upload after the queue has finished. This is next to impossible on bulk files.

A) It would be nice, when uploading files in bulk or uploading large files, to control how many documents it tries to process at once. Example, if I am uploading 1,500 PDF files, a setting to limit the processor to no more than 4 documents at a time (to try and minimize the failures / track which files failed on upload).

B) It would be nice if there was a log file or report produced after a bulk upload that would list which files failed and which were successful. This would make it easier to identify which files need to be re-uploaded.

C) During the upload process, if a file fails to upload due to the API being unavailable, have the system automatically try the file again. Either move the file to the bottom of the list and retry or automatically try and then fail after X number of attempts.

Thank you.

Are there known steps to reproduce?

Windows running Docker version, upload 100+ large (30mb+) documents into the document manager.

Dec 29 '24 17:12 rthwm

When you upload this many files are you using the built-in CPU embedder or something external like ollama or openai?

Dec 30 '24 18:12 timothycarambat

Built in embedder.

Dec 30 '24 19:12 rthwm

Then this constraint is likely arsising from resource constraints as the local embedder is running on CPU only and depending on the document chunk throughput could be crashing or failing or allocate. its unrelated to the retry mechanism proposed, but swapping to something like Ollama or OpenAI may alleviate that as they can be done off-machine or use the GPU on device.

Dec 30 '24 22:12 timothycarambat

I've switched it over to Ollama, rebuilding the embeddings now (going to take a while). Once this completes, I will try uploading another batch of PDFs and see what happens. Ill post back if this fixed the issue or not.

Dec 31 '24 11:12 rthwm

Alright, after switching to Ollama, I am still getting "document processing API is not online" while doing bulk uploads. Granted, there doesn't seem to be nearly as many of these errors, but in a batch upload of around 900 pdf/txt files, I've seen the API offline error come up about 6 times now and counting. Next issue to that (as described initially), once the upload finishes, I will have to delete everything I just uploaded as I can't isolate which files failed vs which ones were successful. The failures do seem to be related to the number of documents being processed at once / the CollectorApi being busy.

Jan 01 '25 19:01 rthwm

Another item to note, I went through the log files to see if I could isolate an error within, with the words "document processing API is not online". Interesting enough, there is no log entry for this exact phrase of error. Searching for "not online", produces no results. The only reference in the logs (which I can't fully confirm is for this exact error and this error is repeated a few times through the logs on different files) is:

2025-01-01 13:02:15 [backend] info: [CollectorApi] Document Cook_better_food.pdf uploaded processed and successfully. It is now available in documents. 2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed 2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed 2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed 2025-01-01 13:02:15 [backend] info: [TELEMETRY SENT] {"event":"document_uploaded","distinctId":"08fe1348-286a-4313-9d72-f6d357f86f90","properties":{"runtime":"docker"}}

This portion of the log may not be fully relevant to the error I am seeing on the front-end as the front-end error doesn't correlate to any direct reference in the backend logs that I can see. It would be nice if the error message was changed from saying "document processing API is not online" to "document processing API is offline", as it would make searching the logs a little easier for failures related to "offline". Even with that, I've gone through the logs line by line (searching for the word "failed") and can't find anything that directly shows this specific error (API is not online) is even happening.

From the front-end, I see 3 different errors at random times.

Text content was empty for document_Name.pdf (which I know what this error means / isn't important or related to the topic).
document processing API is not online
fetch failed (doesn't show the API is not online error, just says fetch failed)

What I am unsure about in the logs, when I see (example) "2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed", is this a log report for error #3 only or is this the recorded log entry for #2 and #3 together.

For this upload experiment, I uploaded 961 files (half pdf the other half txt), 829 were successfully uploaded. This would indicate that 132 files failed to process / upload, because of one of the 3 errors previously mentioned. There is no easy method that I have found to isolate which files failed due to error #1 verse errors #2 or #3. (which I understand is a separate but related from the API not online issue).

Jan 01 '25 19:01 rthwm

Another item to note, I went through the log files to see if I could isolate an error within, with the words "document processing API is not online". Interesting enough, there is no log entry for this exact phrase of error. Searching for "not online", produces no results. The only reference in the logs (which I can't fully confirm is for this exact error and this error is repeated a few times through the logs on different files) is:另一个需要注意的事项是，我检查了日志文件，看看是否可以在其中隔离出一个错误，该错误带有“文档处理API不在线”字样。有趣的是，没有日志条目记录这个确切的错误短语。搜索“不在线”，没有结果。日志中唯一的引用（我不能完全确认是这个确切的错误，这个错误在不同文件的日志中重复了几次）是：

2025-01-01 13:02:15 [backend] info: [CollectorApi] Document Cook_better_food.pdf uploaded processed and successfully. It is now available in documents.2025-01-01 13：02：15 [后端]信息：[CollectorApi]文档Cook_better_food.pdf已成功上传。现在可以在文件中找到。 2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed2025-01-01 13：02：15 [backend]信息：[CollectorApi]获取失败 2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed2025-01-01 13：02：15 [backend]信息：[CollectorApi]获取失败 2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed2025-01-01 13：02：15 [backend]信息：[CollectorApi]获取失败 2025-01-01 13:02:15 [backend] info: [TELEMETRY SENT] {"event":"document_uploaded","distinctId":"08fe1348-286a-4313-9d72-f6d357f86f90","properties":{"runtime":"docker"}}2025-01-01 13：02：15 [后台]信息：[远程发送] {“event”：“document_uploaded”，“distinctId”：“08 fe 1348 - 286 a-4313- 9d 72-f6 d357 f86 f90”，“properties”：{“runtime”：“docker”}}

This portion of the log may not be fully relevant to the error I am seeing on the front-end as the front-end error doesn't correlate to any direct reference in the backend logs that I can see. It would be nice if the error message was changed from saying "document processing API is not online" to "document processing API is offline", as it would make searching the logs a little easier for failures related to "offline". Even with that, I've gone through the logs line by line (searching for the word "failed") and can't find anything that directly shows this specific error (API is not online) is even happening.日志的这一部分可能与我在前端看到的错误不完全相关，因为前端错误与我可以看到的后端日志中的任何直接引用都不相关。如果将错误消息从“文档处理API不在线”更改为“文档处理API离线”，那就太好了，因为这将使搜索与“离线”相关的故障的日志更容易一些。即便如此，我还是逐行查看了日志（搜索“失败”一词），找不到任何直接显示此特定错误（API未在线）的内容。

From the front-end, I see 3 different errors at random times.从前端，我看到3个不同的错误在随机时间。

Text content was empty for document_Name.pdf (which I know what this error means / isn't important or related to the topic).document_Name.pdf的文本内容为空（我知道此错误意味着什么/不重要或与主题无关）。

document processing API is not online文档处理API未联机

fetch failed (doesn't show the API is not online error, just says fetch failed)fetch failed（不显示API不在线错误，只说fetch failed）

What I am unsure about in the logs, when I see (example) "2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed", is this a log report for error #3 only or is this the recorded log entry for #2 and #3 together.我在日志中不确定的是，当我看到（示例）“2025-01-01 13：02：15 [backend] info：[CollectorApi] fetch failed”时，这是错误 #3 的日志报告，还是 #2 和 #3 一起记录的日志条目。

For this upload experiment, I uploaded 961 files (half pdf the other half txt), 829 were successfully uploaded. This would indicate that 132 files failed to process / upload, because of one of the 3 errors previously mentioned. There is no easy method that I have found to isolate which files failed due to error #1 verse errors #2 or #3. (which I understand is a separate but related from the API not online issue).在这次上传实验中，我上传了961个文件（一半是pdf，另一半是txt），其中829个成功上传。这表示有132个文件无法处理/上传，原因是前面提到的3个错误之一。没有简单的方法，我发现隔离哪些文件失败，由于错误 #1 与错误 #2 或 #3 。（据我所知，这是一个独立的，但与API不在线的问题有关）。

Did you sovle this ?

Feb 20 '25 09:02 zxjhellow2

Hello I join the problem,

I also have the problem of blocking uploads, after sending several files via API or by the interface. I tried sending everything at once, and also sending one by one

So I get a "fetch failed" and a "Document Processor Unavailable" in the interface, for only solution restart the docker container to upload new files.

I'm using built in embedder

2025-03-03 15:20:59 [backend] info: [Event Logged] - api_document_uploaded
2025-03-03 15:21:00 [backend] info: [EncryptionManager] Loaded existing key & salt for encrypting arbitrary data.
2025-03-03 15:21:00 [collector] info: -- Working XXX.PPTX --
2025-03-03 15:21:00 [collector] info: [SUCCESS]: XXX.PPTX converted & ready for embedding.
2025-03-03 15:21:00 
2025-03-03 15:21:00 [backend] info: [CollectorApi] Document XXX.PPTX uploaded processed and successfully. It is now available in documents.
2025-03-03 15:21:00 [backend] info: [TELEMETRY SENT] {"event":"document_uploaded","distinctId":"5c7b805e-d7fe-4bf3-8142-889c8ac4708f","properties":{"runtime":"docker"}}
2025-03-03 15:21:00 [backend] info: [Event Logged] - api_document_uploaded
2025-03-03 15:21:01 [backend] info: [EncryptionManager] Loaded existing key & salt for encrypting arbitrary data.
2025-03-03 15:26:02 [backend] info: [CollectorApi] fetch failed

in API response : Document processing API is not online. Document XY.xlsx will not be processed automatically.

Mar 03 '25 14:03 MatisAgr

I tried sending everything at once, and also sending one by one

This would seem to indicate one specific file is the issue, not all of them at the same time. Can you determine which file is the one causing the error? If it is the PPTX file you are using, can you replicate that with the same file consistently? It may just be an issue with PPTX

Mar 03 '25 16:03 timothycarambat

I tried sending everything at once, and also sending one by one

This would seem to indicate one specific file is the issue, not all of them at the same time. Can you determine which file is the one causing the error? If it is the PPTX file you are using, can you replicate that with the same file consistently? It may just be an issue with PPTX

I tried again with the same file and indeed it worked. After analysis, I think that there is an overload of the API and that it goes offline. I made a small program that sends the files in a loop until the API responds. The API seems offline for a period of time, after about ten minutes the files manage to send themselves before re-blocking.

Mar 04 '25 07:03 MatisAgr

Well the collector is a single thread, and if you are uploading documents that require binaries to parse (PPTX, Word, PDF) or have to run OCR (images, scanned PDF) then what will likely occur is an OOM based on the machine/container resources. I believe that is the root cause here since that would crash the collector and would then become unresponsive.

I suppose this is also possible with many many large text files since they need to be opened, read, and then processed. It is just simple IO but can still cause issues during processing

Mar 04 '25 17:03 timothycarambat