CeWL icon indicating copy to clipboard operation
CeWL copied to clipboard

Exclude & Allowed Switches Not Behaving as Expected

Open 03k64serenity opened this issue 3 years ago • 7 comments

https://github.com/digininja/CeWL/blob/280bfe6f8f57a783cf447c47cfb38ad568177d00/cewl.rb#L814

When providing regex patterns in a file for the --exclude or in the command line argument for --allowed, cewl is not properly excluding and allowing offsite URLs based on the rules.

03k64serenity avatar Apr 20 '22 21:04 03k64serenity

It only checks the path and not the domain looking at that line of code. Are you expecting it to check the domain as well?

On Wed, 20 Apr 2022, 22:28 03k64serenity, @.***> wrote:

https://github.com/digininja/CeWL/blob/280bfe6f8f57a783cf447c47cfb38ad568177d00/cewl.rb#L814

When providing regex patterns in a file for the --exclude or in the command line argument for --allowed, cewl is not properly excluding and allowing offsite URLs based on the rules.

— Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/91, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4SWKFURTCXRL7DMWPRATVGBZGPANCNFSM5T5J6WPA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

digininja avatar Apr 20 '22 21:04 digininja

Right. I'd like to be able to limit the spider from crawling certain domains and allow it to crawl others based on a regex.

03k64serenity avatar Apr 20 '22 21:04 03k64serenity

Not currently possible. You could easily tweak that line to check the domain instead. I don't know the property off hand, but try domain instead of path.

On Wed, 20 Apr 2022, 22:35 03k64serenity, @.***> wrote:

Right. I'd like to be able to limit the spider from crawling certain domains and allow it to crawl others based on a regex.

— Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/91#issuecomment-1104476287, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4SWPCWUAPVVQELUTC2XLVGB2CJANCNFSM5T5J6WPA . You are receiving this because you commented.Message ID: @.***>

digininja avatar Apr 20 '22 21:04 digininja

Sounds good. Will do. Hey, by the way...I had no idea you were the author of CeWL all these years seeing you on the interwebs, so I'm even more impressed and grateful for your contributions to the community.

03k64serenity avatar Apr 20 '22 21:04 03k64serenity

Glad you like it.

If you get stuck, let me know, and I'll have a look for the right property in the morning.

On Wed, 20 Apr 2022, 22:40 03k64serenity, @.***> wrote:

Sounds good. Will do. Hey, by the way...I had no idea you were the author of CeWL all these years seeing you on the interwebs, so I'm even more impressed and grateful for your contributions to the community.

— Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/91#issuecomment-1104479216, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4SWP4EDM6R4P3CNQ6W63VGB2TZANCNFSM5T5J6WPA . You are receiving this because you commented.Message ID: @.***>

digininja avatar Apr 20 '22 21:04 digininja

https://github.com/spencer-dollahite/CeWL/blob/master/cewl.rb

This is the sort of approach/feature I'd like to see to have both an allowed and exclude pattern switch for the domain and path. I know the code here isn't perfect, but I think it is close enough for demo purposes. Thoughts?

spencer-dollahite avatar Apr 28 '22 20:04 spencer-dollahite

I'll have a look as soon as I get chance.

On Thu, 28 Apr 2022, 21:09 spencer-dollahite, @.***> wrote:

https://github.com/spencer-dollahite/CeWL/blob/master/cewl.rb

This is the sort of approach/feature I'd like to see to have both an allowed and exclude pattern switch for the domain and path. I know the code here isn't perfect, but I think it is close enough for demo purposes. Thoughts?

— Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/91#issuecomment-1112610369, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4SWJ4UFXCLCPHHAEDVYTVHLV77ANCNFSM5T5J6WPA . You are receiving this because you commented.Message ID: @.***>

digininja avatar Apr 28 '22 21:04 digininja