ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Question]: Why are addresses which resolve to internal IPs considered invalid URLs?

Open suss-mcx opened this issue 7 months ago • 5 comments

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (Language Policy).
  • [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • [x] Please do not modify this template :) and fill in all the required fields.

Describe your problem

Regarding the function is_valid_url, why are addresses which resolve to private IPs considered invalid?

In my case, I'm running RAGFlow in a network with an internal DNS, so I'm unable to crawl webpages hosted from the same network without some awful workarounds (like hardcoding a public IP in the machine's /etc/hosts).

Is there a reason for this limitation and if not, would you be willing to consider relaxing it?

suss-mcx avatar Jun 12 '25 11:06 suss-mcx

Hi @suss-mcx! I'm Dosu and I’m helping the ragflow team.

Internal IPs are considered invalid by the is_valid_url function because the implementation is designed to prevent security risks like Server-Side Request Forgery (SSRF). The function checks if a URL resolves to a private IP (using the ipaddress module) and rejects it to avoid unintentionally exposing internal network resources to external requests. This is a common security practice in web applications that accept user-supplied URLs, especially in environments where the application could be exposed to untrusted input. You can see this logic in the code itself, where any URL resolving to a private IP is rejected as invalid link.

There isn't any project-specific documentation or discussion found that explains this decision in detail, but the approach aligns with standard security best practices. If you need to allow internal URLs in a trusted environment, you would need to modify or bypass this check, but be aware that doing so could introduce security risks if the system is ever exposed to untrusted input.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Join Discord Share on X

dosubot[bot] avatar Jun 12 '25 11:06 dosubot[bot]

The private IP should be valid by is_valid_url.

What's the IP looks like?

KevinHuSh avatar Jun 13 '25 06:06 KevinHuSh

@KevinHuSh In my case the IP the hostname resolved to is 172.22.150.251, and the crawler failes with "Invalid URL". The mentioned is_valid_url function explicitly calls another function from the same file, is_private_ip, and returns False if it is.

suss-mcx avatar Jun 13 '25 07:06 suss-mcx

It should start with http:// or https://, shouldn't be? Give it a try.

KevinHuSh avatar Jun 13 '25 07:06 KevinHuSh

@KevinHuSh Well yes, it does, I'm using a hostname WITH a scheme (like https://), the fact that it resolves to a private IP is what breaks crawling – as I mentioned, overriding system's DNS to resolve via a public IP fixes the problem, so the issue is definitely related to private IP detection

suss-mcx avatar Jun 13 '25 07:06 suss-mcx