Wrong default URLs found

Open ocervell opened this issue 8 months ago • 1 comments

I noticed something in cariddi:

by default it always find the 3 URLs: /, /sitemap.xml, /robots.txt

even if the server is not up.

Try:

echo http://localhost:5465/ | cariddi
echo http://localhost:9999/ | cariddi
echo http://localhost:32231 | cariddi

Apr 28 '25 08:04 ocervell

I'm aware of this behavior. I've applied a major change in cariddi version 1.4.0.

Before 1.4.0

cariddi prints URLs in the event onResponse: this means whenever there is an HTTP response it takes the HTTP request associated and prints request.URL. Pro: printed URLs are valid Cons: if there is a chain of redirects (even one), e.g. Req1 -> Req2 -> Req3 -> Resp 200 OK, request.URL will be the URL of Req3, but that one could also be associated with a domain out of scope. As example target.com/login redirects to authsso.com/login which gives 200 OK as response. In this case cariddi will print authsso.com/login, but it's out of scope.

1.4.0

cariddi prints URLs in the event onRequest: this means whenever there is an HTTP request, prints the requested URL. Pro: No more out of scope, if a URL is requested it means we are interested at that URL (because it was found crawling the target(s)). Cons: At the start for every target domain, cariddi sends requests to root path, robots.txt and sitemap.xml to check if they are present (because this improves a lot results found). But this means those three URLs are printed even if they are not valid.

@ocervell Let me know your thoughts and if you have an idea to solve this.

Apr 28 '25 08:04 edoardottt