malspider
malspider copied to clipboard
ssl support
Currently don't see any SSL support for crawling sites with SSL enabled?
Malspider generates "start" urls to crawl and they are all http. I can add https start urls fairly easily, but I don't know if that will fix the problem you are experiencing or not.
With the exception of the above issue, Malspider can crawl https sites. Can you elaborate more on the error message or problem you are experiencing?
I believe @79617261 is having the same issue I am. If I add a domain, www.mydomain.com, and it supports TLS, malspider will default to http:// not https://. If there is a redirect, malspider does not appear to follow it and will simply stop spidering the site.
In short, I want to be able to force https:// and simply not default to http://.
Hi Marcus,
Thank you for following up on this. I fixed a bug that was causing the spider to not follow 301/302 redirects, but I haven't committed the code yet. There is still the issue of needing to supply a list of start urls to the spider. I currently supply "http://", "http://www.", "https://" and " https://www." as the start urls to support various cases. I'll see if there is a way I can force https if the site supports it and avoid crawling any http pages... I will get back to you tomorrow.
-James
On Wed, Jul 13, 2016 at 9:20 AM, Marcus LaFerrera [email protected] wrote:
I believe @79617261 https://github.com/79617261 is having the same issue I am. If I add a domain, www.mydomain.com, and it supports TLS, it will default to http:// not https://. If there is a redirect, malspider does not appear to follow it and will simply stop spidering the site.
In short, I want to be able to force https:// and simply not default to http://.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ciscocsirt/malspider/issues/7#issuecomment-232352441, or mute the thread https://github.com/notifications/unsubscribe/AR0QEIub3oN3t9prPC8rm8j3FiFH38Peks5qVOYsgaJpZM4I2nZd .
Thank you for your patience.
I pushed the code. Redirects are enabled and https is priortized in the list of start_urls, but again, this doesn't mean an http page will never be hit.
I can think of a few ways to further modify the spider to only support https, the most reasonable is the following:
Update the LxmlLinkExtractor loop at the bottom of (malspider/spiders/full_domain_spider.py) with a regex to only allow https links. Change:
for link in LxmlLinkExtractor(unique=True,
allow_domains=self.allowed_domains).extract_links(response):
to
for link in LxmlLinkExtractor(allow=r'<your_https_regex>',unique=True,
allow_domains=self.allowed_domains).extract_links(response):
and then remove any http start URLs from malspider_django/dashboard/management/commands/manage_spiders.py
-James
On Thu, Jul 14, 2016 at 4:23 PM, James Sheppard [email protected] wrote:
Hi Marcus,
Thank you for following up on this. I fixed a bug that was causing the spider to not follow 301/302 redirects, but I haven't committed the code yet. There is still the issue of needing to supply a list of start urls to the spider. I currently supply "http://", "http://www.", "https://" and " https://www." as the start urls to support various cases. I'll see if there is a way I can force https if the site supports it and avoid crawling any http pages... I will get back to you tomorrow.
-James
On Wed, Jul 13, 2016 at 9:20 AM, Marcus LaFerrera < [email protected]> wrote:
I believe @79617261 https://github.com/79617261 is having the same issue I am. If I add a domain, www.mydomain.com, and it supports TLS, it will default to http:// not https://. If there is a redirect, malspider does not appear to follow it and will simply stop spidering the site.
In short, I want to be able to force https:// and simply not default to http://.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ciscocsirt/malspider/issues/7#issuecomment-232352441, or mute the thread https://github.com/notifications/unsubscribe/AR0QEIub3oN3t9prPC8rm8j3FiFH38Peks5qVOYsgaJpZM4I2nZd .