Georgiy Zatserklianyi comments

Results 37 comments of


                                            Georgiy Zatserklianyi

SitemapSpider memory issues

> I'm using SitemapSpider on a sitemapindex consisting of 20-30 sitemaps **each having 50k urls**. > **Even trying each sitemap alone ends up eating all the memory on a 6gb...

Add option to retrieve text content

One working option Is to use.. chaining css calls with `*::text` query applied to selector that contain text we aimed to scrape. Applied solution on example html sample from issue...

Per slot settings

Updated pull request. As this https://github.com/scrapy/scrapy/issues/3585 is open and we don't have any other mention of downloader slot component in docs - at this stage it is not clear how...

> How does this solve #3529? This pull request doesn't solve https://github.com/scrapy/scrapy/issues/3529 itself However applying functionality from this PR into SitemapSpider (I think) can. By asigning requests to sitemaps to...

Sitemap spider does not resume though JOBDIR is set

I think that in scope of this issue it is worth to note that in base spider class with defined `start_urls` - requests from `start_urls` yield with `dont_filter=True` by `start_requests`...

Support huge_tree=False?

Here in `scrapy.selector.unified.Selector`(subclass of original parsel/selector) - we are using scrapy `Response` object to create Selector object: https://github.com/scrapy/scrapy/blob/4af5a06842bfa0b169348d7a0b54e668ed58baa6/scrapy/selector/unified.py#L67-L82 1. Inside it's init we can call `response.request.callback` that links to spider...

provide a way to work with scrapy http cache without making requests

I conclude that it is realistic to extend existing CacheStorage classes to implement possibility to.. read responses from cache.. outside scraping process (separately). > cache storages want 'spider' and 'request'...

Don't require 'name' attribute for scrapy.Spider

List of usages of `spider.name` attribute in scrapy source code: 1. scrapy/utils/engine/get_engine_status Used in memusage extension, telnet console extension and in at least 1 test [get_engine_status query](https://github.com/search?q=repo%3Ascrapy%2Fscrapy%20get_engine_status&type=code) https://github.com/scrapy/scrapy/blob/42b3a3a23b328741f1720953235a62cba120ae7b/scrapy/utils/engine.py#L11-L27 2. scrapy/extensions/statsmailer.StatsMailer...

Process cookies from header only once

> What if a retry (instead of a redirect) needs to happen, because the server sent e.g. a 503 response. Wouldn’t the retry miss some required cookies? I think that...

Process cookies from header only once

I am not able to reproduce this issue on recent master script.py ```python from scrapy import FormRequest import scrapy from scrapy.crawler import CrawlerProcess class LoginSpider(scrapy.Spider): name = 'login' # allowed_domains...