[BUG]: File Parsing Fails for URLs Without Explicit File Extensions
How are you running AnythingLLM?
All versions
What happened?
When attempting to pull and parse a file using either the RAG Modal or @agent mode, the process fails if the URL does not explicitly end with a file extension (e.g., .pdf, .csv). This occurs even when the server responds with a correct Content-Type header that identifies the file type.
Example:
The following URL fails to be processed, despite responding with an application/pdf content type:
https://arxiv.org/pdf/2307.10265
Observed Behavior:
The application logs display the following error:
[2] Error processing single file File extension .10265 not supported for parsing and cannot be assumed as text file type.
This error originates from the file extension guard located at:
https://github.com/Mintplex-Labs/anything-llm/blob/89a01492b51a23150b59732166b90ebdd1843c50/collector/processSingleFile/index.js#L58-L72
Expected Behavior:
The system should be able to successfully pull and parse files from URLs that do not explicitly contain a file extension, provided the Content-Type header in the server's response clearly indicates the file's MIME type.
Are there known steps to reproduce?
No response
I would like to work on this @timothycarambat