CeWL
CeWL copied to clipboard
Feature request: Follow Subdomains
I noticed that CeWL doesn't follow subdomains.
cewl http://www.domain.com
does not traverse into http://sub.domain.com
cewl http://domain.com
does not work. Neither does
cewl http://*.domain.com
Would be nice to have that as additional feature.
Thanks! Christian
That is deliberate. CeWL sticks to the domain it has been asked to spider unless you set the flag to let it go off site. This is to stop it going mad and spidering the whole internet.
It may be possible to work out subdomains but it could get messy and very wide very quickly so not something I'm likely to implement.
On Thu, 29 Nov 2018 at 14:09, Christian Aigner [email protected] wrote:
I noticed that CeWL doesn't follow subdomains.
cewl http://www.domain.com
does not traverse into http://sub.domain.com
cewl http://domain.com
does not work. Neither does
cewl http://*.domain.com
Would be nice to have that as additional feature.
Thanks! Christian
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/45, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHJWeCjRDcMIUIjO3DPhNZlNl80oVbDks5uz-qagaJpZM4Y5wDk .
I think it would be ok if CeWL followed links which lead to subdomains. Nothing wrong in spidering www.mydomain.com summer.mydomain.com winter.mydomain.com.
It would be just something in between staying within the domain and going wild by using the -o option. And if you think about it: subdomains are just like subdirectories.
On some sites they are like subdirectories but on others they are completely different sites.
I'll have a think about it, part of it depends on how easy the spider is to manipulate to get it to understand subdomains.
On Thu, 29 Nov 2018 at 14:43, Christian Aigner [email protected] wrote:
I think it would be ok if CeWL followed links which lead to subdomains. Nothing wrong in spidering www.mydomain.com summer.mydomain.com winter.mydomain.com.
It would be just something in between staying within the domain and going wild by using the -o option. And if you think about it: subdomains are just like subdirectories.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/45#issuecomment-442858303, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHJWagE4ft4Lz7iQ4EG_XYzZIKVxO8qks5uz_KegaJpZM4Y5wDk .
I've been thinking about this and trying to work out parentage is probably going to be too hard. Trying to work out where the domain ends and the TLD starts could get messy and result in either scans exploding or being a lot shorter than expected.
On Thu, 29 Nov 2018 at 14:45, Robin Wood [email protected] wrote:
On some sites they are like subdirectories but on others they are completely different sites.
I'll have a think about it, part of it depends on how easy the spider is to manipulate to get it to understand subdomains.
On Thu, 29 Nov 2018 at 14:43, Christian Aigner [email protected] wrote:
I think it would be ok if CeWL followed links which lead to subdomains. Nothing wrong in spidering www.mydomain.com summer.mydomain.com winter.mydomain.com.
It would be just something in between staying within the domain and going wild by using the -o option. And if you think about it: subdomains are just like subdirectories.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/45#issuecomment-442858303, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHJWagE4ft4Lz7iQ4EG_XYzZIKVxO8qks5uz_KegaJpZM4Y5wDk .
@digininja , what if the user was permitted to enable crawling of subdomains by modifying the code for the --allowed option argument to match a regular expression for the full URL rather than only the path?
diff --git a/cewl.rb b/cewl.rb
index f9dfe02..7fd7f78 100755
--- a/cewl.rb
+++ b/cewl.rb
@@ -811,8 +811,8 @@ catch :ctrl_c do
allow = false
end
- if allowed_pattern && !a_url_parsed.path.match(allowed_pattern)
- puts "Excluding path: #{a_url_parsed.path} based on allowed pattern" if verbose
+ if allowed_pattern && !a_url_parsed.to_s.match(allowed_pattern)
+ puts "Excluding URL: #{a_url_parsed.to_s} based on allowed pattern" if verbose
allow = false
end
end
Then the user could set the -o option and something like "--allowed='(http(s|):\/\/domain.com|.*\.domain.com|^domain.com)($|\/.*)|^\/.*'" to allow crawling of other subdomains from the original URL as well as relative paths. It's easy to mess up a regex like this and visit an unintended site, I'll admit, but the user should understand and accept responsibility that they're explicitly allowing offsite spidering when enabling the -o option.