Georgiy Zatserklianyi
Georgiy Zatserklianyi
> I'm using SitemapSpider on a sitemapindex consisting of 20-30 sitemaps **each having 50k urls**. > **Even trying each sitemap alone ends up eating all the memory on a 6gb...
One working option Is to use.. chaining css calls with `*::text` query applied to selector that contain text we aimed to scrape. Applied solution on example html sample from issue...
Updated pull request. As this https://github.com/scrapy/scrapy/issues/3585 is open and we don't have any other mention of downloader slot component in docs - at this stage it is not clear how...
> How does this solve #3529? This pull request doesn't solve https://github.com/scrapy/scrapy/issues/3529 itself However applying functionality from this PR into SitemapSpider (I think) can. By asigning requests to sitemaps to...
I think that in scope of this issue it is worth to note that in base spider class with defined `start_urls` - requests from `start_urls` yield with `dont_filter=True` by `start_requests`...
Here in `scrapy.selector.unified.Selector`(subclass of original parsel/selector) - we are using scrapy `Response` object to create Selector object: https://github.com/scrapy/scrapy/blob/4af5a06842bfa0b169348d7a0b54e668ed58baa6/scrapy/selector/unified.py#L67-L82 1. Inside it's init we can call `response.request.callback` that links to spider...
I conclude that it is realistic to extend existing CacheStorage classes to implement possibility to.. read responses from cache.. outside scraping process (separately). > cache storages want 'spider' and 'request'...
List of usages of `spider.name` attribute in scrapy source code: 1. scrapy/utils/engine/get_engine_status Used in memusage extension, telnet console extension and in at least 1 test [get_engine_status query](https://github.com/search?q=repo%3Ascrapy%2Fscrapy%20get_engine_status&type=code) https://github.com/scrapy/scrapy/blob/42b3a3a23b328741f1720953235a62cba120ae7b/scrapy/utils/engine.py#L11-L27 2. scrapy/extensions/statsmailer.StatsMailer...
> What if a retry (instead of a redirect) needs to happen, because the server sent e.g. a 503 response. Wouldn’t the retry miss some required cookies? I think that...
I am not able to reproduce this issue on recent master script.py ```python from scrapy import FormRequest import scrapy from scrapy.crawler import CrawlerProcess class LoginSpider(scrapy.Spider): name = 'login' # allowed_domains...