Wrong default URLs found
I noticed something in cariddi:
by default it always find the 3 URLs: /, /sitemap.xml, /robots.txt
even if the server is not up.
Try:
echo http://localhost:5465/ | cariddi
echo http://localhost:9999/ | cariddi
echo http://localhost:32231 | cariddi
I'm aware of this behavior. I've applied a major change in cariddi version 1.4.0.
Before 1.4.0
cariddi prints URLs in the event onResponse: this means whenever there is an HTTP response it takes the HTTP request associated and prints request.URL.
Pro: printed URLs are valid
Cons: if there is a chain of redirects (even one), e.g. Req1 -> Req2 -> Req3 -> Resp 200 OK, request.URL will be the URL of Req3, but that one could also be associated with a domain out of scope.
As example target.com/login redirects to authsso.com/login which gives 200 OK as response. In this case cariddi will print authsso.com/login, but it's out of scope.
1.4.0
cariddi prints URLs in the event onRequest: this means whenever there is an HTTP request, prints the requested URL. Pro: No more out of scope, if a URL is requested it means we are interested at that URL (because it was found crawling the target(s)). Cons: At the start for every target domain, cariddi sends requests to root path, robots.txt and sitemap.xml to check if they are present (because this improves a lot results found). But this means those three URLs are printed even if they are not valid.
@ocervell Let me know your thoughts and if you have an idea to solve this.