ImageScraper Css scraping

Css scraping

Open kennedyshead opened this issue 8 years ago • 2 comments

        css = tree.xpath("//link[@type='text/css']/@href")
        css_images = list()
        for css_file in css:
            if not re.match(r'^[a-zA-Z]+://', css_file):
                css_file = self.url + css_file
            image_list = re.findall('url\(([^)]+)\)', requests.get(css_file).content.decode('utf-8'))

            for image in image_list:
                if image.startswith('//'):
                    image = 'https:' + image
                if not re.match(r'^[a-zA-Z]+://', image):
                    image = self.url + image.strip('"').strip("'").strip('../')

                css_images.append(image)
        self.images.extend(self.process_links(css_images))

Feb 16 '17 10:02 kennedyshead

Sorry have no time for PR atm.

Feb 16 '17 10:02 kennedyshead

And there need to be some more error handling for it to work with http:// And the strip part is kind of ugly

Feb 16 '17 10:02 kennedyshead

ImageScraper ImageScraper copied to clipboard

Css scraping

ImageScraper
ImageScraper copied to clipboard