scrapy-html-storage Add configuration for saving HTML when URL meets a certain pattern

Add configuration for saving HTML when URL meets a certain pattern

Open BurnzZ opened this issue 5 years ago • 1 comments

Hello @povilasb!

As the title describes, I was wondering if we could implement something like this?

The use case I have is that if a spider is crawling a site with lots of different pages with the crawl logic scattered into the different parts of the code, manually declaring the save_html in the meta might be tedious. I'm proposing of a way that perhaps if the middleware finds a certain pattern in the URL, it will save that HTML.

For example:

HTML_STORAGE = {
    'save_on_url_patterns': [r'website.com/page\d+.html', 'website.com/section-\w+.html']
}

Let me know what you think and I could send a PR.

Cheers!

Nov 30 '19 12:11 BurnzZ

scrapy-html-storage scrapy-html-storage copied to clipboard

Add configuration for saving HTML when URL meets a certain pattern

scrapy-html-storage
scrapy-html-storage copied to clipboard