itemloaders Speed regression in ItemLoader since 1.0

Hi

TLDR: ItemLoader does not reuse response.selector when you are passing response to it as argument. And looks like it was reusing it up to Scrapy==1.0

Recently we were trying to upgrade our setup to use newer version of Scrapy (we are on 1.0.5). We noticed huge slowdown in a lot of our spiders.

I dag down a bit and found that the bottleneck was ItemLoader, in particular when you create many ItemLoaders (one for each Item) passing response to every one of them and the response size is relatively big.

The speed drop down goes back to version 1.1. So on Scrapy==1.1 and above the performance degradation is present.

Test spider code (testfile is just a random file of 1MB size):

# -*- coding: utf-8 -*-
import os.path

import scrapy
from scrapy.loader import ItemLoader
from scrapy.item import Item, Field

HERE = os.path.abspath(os.path.dirname(__file__))


class Product(Item):
    url = Field()
    name = Field()


class TestSpeedSpider(scrapy.Spider):
    name = 'test_speed'

    def start_requests(self):
        file_path = HERE + '/testfile'

        file_url = 'file://' + file_path
        yield scrapy.Request(file_url)

    def parse(self, response):
        for i in xrange(0, 300):
            loader = ItemLoader(item=Product(), response=response)
            loader.add_value('name', 'item {}'.format(i))
            loader.add_value('url', 'http://site.com/item{}'.format(i))

            product = loader.load_item()

            yield product

Feb 15 '18 11:02 juraseg

Hi @juraseg - thanks for raising this! I'm looking back over the history to get familiar with how these objects were instantiated before, and now. Lots of ground to cover. But, it looks like you could deliberately provide a selector instance to the constructor for ItemLoader, and sidestep the issue entirely. Is this something you've explored?

Feb 16 '18 13:02 ghost

As an experiment, I added a line to the ItemLoader constructor that would check response._cached_selector is not None and would use it if present. On Py2.7 and Py3.5 locally, it seems to pass tests well enough. So, this ought to be fixable, but we should probably discuss where is the most appropriate place to re-use selectors in this way. If implemented incautiously, then it might clobber custom selector subclasses silently and unexpectedly.

Feb 16 '18 14:02 ghost

I believe regression happened when lxml document cache was removed, as a part of https://github.com/scrapy/scrapy/pull/1409. The idea was to use response.selector explicitly; we fixed it for link extractors (https://github.com/scrapy/scrapy/pull/1409/files#diff-952f1c77a1cfacb08eed58f08daa8870R67), but not for item loaders.

Item loaders are a bit different because they alow to customize selector class, this is where we should be careful implementing a fix.

Accessing response._cached_selector is not None doesn't look clean as a final solution, as we'll be accessing a private attribute. Even if there is no cached selector yet, it makes sense to use response.selector (which caches transparently) - this way if some further code needs a selector, body won't be parsed one more time.

Feb 16 '18 15:02 kmike

Hi guys

@cathalgarvey Yes, I tried providing selector instead of response and it does the job. So really a matter of reusing selector from response.

If that helps, I did what @kmike suggests for our case to solve the issue - created custom loader, which checks if there is response.selector. To workaround the issue with custom selector this loader checks if response.selector.__class__ is equal to loader's default_selector_class.

The class check looks a bit clunky, but I have not found other way to check (isinstance won't work as it will return True for subclasses as well, which is not what we need here).

Feb 16 '18 16:02 juraseg

@juraseg I think your approach is fine, but there is an extra tricky part: when you access response.selector.__class__, response body could be parsed in order to fill the "selector" attribute for the first time. If a class is not the same as ItemLoader's default_selector_class, response is parsed again. The second one is inevitable, but parsing response with a wrong selector just to get its type looks unnecessary.

Feb 16 '18 18:02 kmike

@kmike Ah, I see what you mean.

So the only way to avoid second parsing is to know which class the would be response.selector is and compare to it instead of accessing response.selector. For example response object could have selector_class read-only property.

Feb 17 '18 06:02 juraseg

Hey guys, I tried implementing the modification to ItemLoader constructor. So when there is response.selector available, we will use that instead of creating a new Selector. The result for my sample code has shown that the performance is equal to that of Scrapy 1.0. However, it seems like the discussion is still continuing so I am just wondering if you guys can tell me more about the potential outcome if I do something shown on this diff scrapy/scrapy#3157

Mar 08 '18 07:03 malloxpb

itemloaders itemloaders copied to clipboard

Speed regression in ItemLoader since 1.0

itemloaders
itemloaders copied to clipboard