CeWL Feature request: Follow Subdomains

I noticed that CeWL doesn't follow subdomains.

cewl http://www.domain.com

does not traverse into http://sub.domain.com

cewl http://domain.com

does not work. Neither does

cewl http://*.domain.com

Would be nice to have that as additional feature.

Thanks! Christian

Nov 29 '18 14:11 caigner

That is deliberate. CeWL sticks to the domain it has been asked to spider unless you set the flag to let it go off site. This is to stop it going mad and spidering the whole internet.

It may be possible to work out subdomains but it could get messy and very wide very quickly so not something I'm likely to implement.

On Thu, 29 Nov 2018 at 14:09, Christian Aigner [email protected] wrote:

I noticed that CeWL doesn't follow subdomains.

cewl http://www.domain.com

does not traverse into http://sub.domain.com

cewl http://domain.com

does not work. Neither does

cewl http://*.domain.com

Would be nice to have that as additional feature.

Thanks! Christian

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/45, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHJWeCjRDcMIUIjO3DPhNZlNl80oVbDks5uz-qagaJpZM4Y5wDk .

Nov 29 '18 14:11 digininja

I think it would be ok if CeWL followed links which lead to subdomains. Nothing wrong in spidering www.mydomain.com summer.mydomain.com winter.mydomain.com.

It would be just something in between staying within the domain and going wild by using the -o option. And if you think about it: subdomains are just like subdirectories.

Nov 29 '18 14:11 caigner

On some sites they are like subdirectories but on others they are completely different sites.

I'll have a think about it, part of it depends on how easy the spider is to manipulate to get it to understand subdomains.

On Thu, 29 Nov 2018 at 14:43, Christian Aigner [email protected] wrote:

I think it would be ok if CeWL followed links which lead to subdomains. Nothing wrong in spidering www.mydomain.com summer.mydomain.com winter.mydomain.com.

It would be just something in between staying within the domain and going wild by using the -o option. And if you think about it: subdomains are just like subdirectories.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/45#issuecomment-442858303, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHJWagE4ft4Lz7iQ4EG_XYzZIKVxO8qks5uz_KegaJpZM4Y5wDk .

Nov 29 '18 14:11 digininja

I've been thinking about this and trying to work out parentage is probably going to be too hard. Trying to work out where the domain ends and the TLD starts could get messy and result in either scans exploding or being a lot shorter than expected.

On Thu, 29 Nov 2018 at 14:45, Robin Wood [email protected] wrote:

On some sites they are like subdirectories but on others they are completely different sites.

I'll have a think about it, part of it depends on how easy the spider is to manipulate to get it to understand subdomains.

On Thu, 29 Nov 2018 at 14:43, Christian Aigner [email protected] wrote:

I think it would be ok if CeWL followed links which lead to subdomains. Nothing wrong in spidering www.mydomain.com summer.mydomain.com winter.mydomain.com.

It would be just something in between staying within the domain and going wild by using the -o option. And if you think about it: subdomains are just like subdirectories.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/45#issuecomment-442858303, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHJWagE4ft4Lz7iQ4EG_XYzZIKVxO8qks5uz_KegaJpZM4Y5wDk .

Dec 03 '18 13:12 digininja

@digininja , what if the user was permitted to enable crawling of subdomains by modifying the code for the --allowed option argument to match a regular expression for the full URL rather than only the path?

diff --git a/cewl.rb b/cewl.rb
index f9dfe02..7fd7f78 100755
--- a/cewl.rb
+++ b/cewl.rb
@@ -811,8 +811,8 @@ catch :ctrl_c do
                                                        allow = false
                                                end
 
-                                               if allowed_pattern && !a_url_parsed.path.match(allowed_pattern)
-                                                       puts "Excluding path: #{a_url_parsed.path} based on allowed pattern" if verbose
+                                                if allowed_pattern && !a_url_parsed.to_s.match(allowed_pattern)
+                                                       puts "Excluding URL: #{a_url_parsed.to_s} based on allowed pattern" if verbose
                                                        allow = false
                                                end
                                        end

Then the user could set the -o option and something like "--allowed='(http(s|):\/\/domain.com|.*\.domain.com|^domain.com)($|\/.*)|^\/.*'" to allow crawling of other subdomains from the original URL as well as relative paths. It's easy to mess up a regex like this and visit an unintended site, I'll admit, but the user should understand and accept responsibility that they're explicitly allowing offsite spidering when enabling the -o option.

Sep 28 '21 16:09 Mayyhem

CeWL CeWL copied to clipboard

Feature request: Follow Subdomains

CeWL
CeWL copied to clipboard