RFE: Accept file:// for URI for RAGDocument
🚀 Describe the new functionality needed
Hi llama-stack team, I built an application with a portal and a backend which processes PDFs, either via a URL or a file upload to the server.
The URL component works with a RAGDocument type, but when I try to pass it a file:// uri, I get an error:
route='/v1/tool-runtime/rag-tool/insert' method='post': Request URL is missing an 'http://' or 'https://' protocol.
This is a fairly common use case for teams doing document ingestion.
💡 Why is this needed? What if we don't build it?
This is needed because I now have 2 different methods for PDF ingestion - one by llama-stack and one hand-rolled method by me, which will result in differences in my vector database. It increases the scope and capability of what llama-stack can accomplish.
If you don't build it, people will need a workaround for something you largely have the capabilities for already.
Other thoughts
No response
This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.
/assign This is a legitimate bug that creates confusion for users. The bug exists in two locations:
vector_store.py:144 - The content_from_doc() function uses a regex pattern that matches file://:
pattern = re.compile("^(https?://|file://|data:)")
However, lines 148-152 then call httpx.AsyncClient().get() which fails for file:// URLs.
memory.py:63-68 - The raw_data_from_doc() function has the same issue - it handles data: URLs specially but passes file:// URLs directly to httpx.
Security wise consideration: Enabling file:// support introduces potential security risks that must be addressed:
Path Traversal Attacks: Malicious URIs like file:///etc/passwd or file:///../../../etc/shadow could expose sensitive system files Symlink Following: Files could be symlinks to sensitive locations Directory Listing: Requests for directories should be explicitly rejected Network Attacks: file:// URIs pointing to network shares could be exploited