parsel icon indicating copy to clipboard operation
parsel copied to clipboard

Create root node memory 210

Open GeorgeA92 opened this issue 3 years ago • 4 comments

Aimed to fix #210. features of https://github.com/scrapy/parsel/pull/213 implemented as extension to default parsel.Selector class according to @Gallaecio suggestion. in case if input parameter text for Selector - string:
like sel= Selector(text='<ul><li id="1">1</li><li id="2">2</li></ul>')
-> expected.. working without changes as it works now.

s1 = Selector(text='<ul><li id="1">1</li><li id="2">2</li></ul>')
print(s1.css('li::text').getall())
# output -> ['1', '2']

if text is bytes (current vetsion raises TypeError): it is expected that parser will interpret bytes input according to encoding parameter added in this PR:

s2 = Selector(text=b'<ul><li id="1">1</li><li id="2">2</li></ul>', encoding='ascii')
print(s2.css('li::text').getall())
# output -> ['1', '2']

s3 = Selector(text=b'<ul><li id="1">1\xD0\xA4</li><li id="2">2</li></ul>', encoding='utf8') #cyryllic Ф symbol added
print(s3.css('li::text').getall())
# output -> ['1Ф', '2']

In case if text-bytes and encoding is not specified -> it will interpret input as utf8

s4 = Selector(text=b'<ul><li id="1">1\xD0\xA4</li><li id="2">2</li></ul>')
print(s4.css('li::text').getall())
# output -> ['1Ф', '2']
code sample (scrapy) with usage of updated `Selector` class
import scrapy
from scrapy.crawler import CrawlerProcess
from parsel.selector import Selector

class QuotesToScrapeSpider(scrapy.Spider):
    name = "quotes"
    custom_settings = {
        "DOWNLOAD_DELAY":1,
        "DOWNLOADER_MIDDLEWARES":
            {
                'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': None,
            }
    }
    def start_requests(self):
        yield scrapy.Request(url='https://quotes.toscrape.com', callback=self.parse)

    def parse(self, response):
        print(f"memory allocation (body) as str made: {str(bool(response._cached_ubody))}") # < expected False
        sel = Selector(response.body, encoding=response.encoding) # expected encoding Utf8
        links = sel.css("a::attr(href)").getall()
        print(links)
        print(f"memory allocation (body) as str made: {str(bool(response._cached_ubody))}")
process = CrawlerProcess()
process.crawl(QuotesToScrapeSpider)
process.start()

GeorgeA92 avatar May 13 '21 16:05 GeorgeA92

Trying to trigger tests…

Gallaecio avatar Jul 05 '21 07:07 Gallaecio

@Gallaecio Created new testcases for checking selectors with bytes input.

GeorgeA92 avatar Sep 17 '21 15:09 GeorgeA92

Codecov Report

Merging #217 (c5597a7) into master (f5f73d3) will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master      #217   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            5         5           
  Lines          290       293    +3     
  Branches        59        60    +1     
=========================================
+ Hits           290       293    +3     
Impacted Files Coverage Δ
parsel/selector.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f5f73d3...c5597a7. Read the comment docs.

codecov[bot] avatar Sep 25 '21 10:09 codecov[bot]

Maybe this needs conflict resolution before the tests can restart?

wRAR avatar Jan 28 '22 13:01 wRAR