Exclude & Allowed Switches Not Behaving as Expected
https://github.com/digininja/CeWL/blob/280bfe6f8f57a783cf447c47cfb38ad568177d00/cewl.rb#L814
When providing regex patterns in a file for the --exclude or in the command line argument for --allowed, cewl is not properly excluding and allowing offsite URLs based on the rules.
It only checks the path and not the domain looking at that line of code. Are you expecting it to check the domain as well?
On Wed, 20 Apr 2022, 22:28 03k64serenity, @.***> wrote:
https://github.com/digininja/CeWL/blob/280bfe6f8f57a783cf447c47cfb38ad568177d00/cewl.rb#L814
When providing regex patterns in a file for the --exclude or in the command line argument for --allowed, cewl is not properly excluding and allowing offsite URLs based on the rules.
— Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/91, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4SWKFURTCXRL7DMWPRATVGBZGPANCNFSM5T5J6WPA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Right. I'd like to be able to limit the spider from crawling certain domains and allow it to crawl others based on a regex.
Not currently possible. You could easily tweak that line to check the domain instead. I don't know the property off hand, but try domain instead of path.
On Wed, 20 Apr 2022, 22:35 03k64serenity, @.***> wrote:
Right. I'd like to be able to limit the spider from crawling certain domains and allow it to crawl others based on a regex.
— Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/91#issuecomment-1104476287, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4SWPCWUAPVVQELUTC2XLVGB2CJANCNFSM5T5J6WPA . You are receiving this because you commented.Message ID: @.***>
Sounds good. Will do. Hey, by the way...I had no idea you were the author of CeWL all these years seeing you on the interwebs, so I'm even more impressed and grateful for your contributions to the community.
Glad you like it.
If you get stuck, let me know, and I'll have a look for the right property in the morning.
On Wed, 20 Apr 2022, 22:40 03k64serenity, @.***> wrote:
Sounds good. Will do. Hey, by the way...I had no idea you were the author of CeWL all these years seeing you on the interwebs, so I'm even more impressed and grateful for your contributions to the community.
— Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/91#issuecomment-1104479216, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4SWP4EDM6R4P3CNQ6W63VGB2TZANCNFSM5T5J6WPA . You are receiving this because you commented.Message ID: @.***>
https://github.com/spencer-dollahite/CeWL/blob/master/cewl.rb
This is the sort of approach/feature I'd like to see to have both an allowed and exclude pattern switch for the domain and path. I know the code here isn't perfect, but I think it is close enough for demo purposes. Thoughts?
I'll have a look as soon as I get chance.
On Thu, 28 Apr 2022, 21:09 spencer-dollahite, @.***> wrote:
https://github.com/spencer-dollahite/CeWL/blob/master/cewl.rb
This is the sort of approach/feature I'd like to see to have both an allowed and exclude pattern switch for the domain and path. I know the code here isn't perfect, but I think it is close enough for demo purposes. Thoughts?
— Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/91#issuecomment-1112610369, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4SWJ4UFXCLCPHHAEDVYTVHLV77ANCNFSM5T5J6WPA . You are receiving this because you commented.Message ID: @.***>