itemloaders icon indicating copy to clipboard operation
itemloaders copied to clipboard

Extend ItemLoader processors

Open Matthijsy opened this issue 6 years ago • 6 comments

Currently there are three methods to add ItemLoader processor:

  • The default_input/output_processor on the ItemLoader class
  • The field_name_in/out on the ItemLoader class
  • The input/output_processor on the scrapy.Field

Personally I use the input/output_processor on the scrapy.Field combined with the default_input/output_processor a lot. But I use those in combination. Often I just want to add one more processor after the default processors. Since input/output_processor on scrapy.Field does a override of the defaults this is quite hard to do. So I would propose to add another method to add a input/output processors. I would like to have something like add_input/output on the scrapy.Field, which would add the specified processor to the default processor.

I did implement this on my own ItemLoader class but think that it would be usefull for the scrapy core. My implementation is as follows (original source: https://github.com/scrapy/scrapy/blob/master/scrapy/loader/init.py#L69). Ofcourse this can be added to get_output_processor in the same way.

def get_input_processor(self, field_name):
        proc = getattr(self, '%s_in' % field_name, None)
        if not proc:
            override_proc = self._get_item_field_attr(field_name, 'input_processor')
            extend_proc = self._get_item_field_attr(field_name, 'add_input')
            if override_proc and extend_proc:
                raise ValueError(f'Not allowed to define input_processor and add_input to {field_name}')
            if override_proc:
                return override_proc
            elif extend_proc:
                return Compose(self.default_input_processor, extend_proc)
            return self.default_input_processor
        return proc

I am not sure if add_input is a good name, probably extend_input_processor is more clear but this quite a long name. I would like to hear if more people are wanting this feature and what you all think about what the naming should be.

Matthijsy avatar Jan 11 '19 21:01 Matthijsy

I am having a hard time trying to picture what you want done, even after reading the documentation from your pull request.

Could you provide some sample code that shows the (convoluted) way to achieve your goal before your changes, and the (simpler) way to do the same after your suggested changes?

Gallaecio avatar Mar 22 '19 14:03 Gallaecio

We have a custom itemloader which specifies some default input processors. Some fields needs an additional input processor which is specific for this field. Now we need to copy the default input processor and add the new one. We we want to add a new default input processor in the future we have to do this at all those fields again. I would like to be able to keep the default input processor and only extend it with one new input processor.

Example: We have as default input processor a strip and remove tags function to clean some data. Now we have a scrapy item representing an person with two fields:

  • Name -> for this field those input processors are fine
  • Height -> This data is in the format 1.50m for example but we want to have it in cm as integer (150). So we want to add an new input processor to do this, but keep the strip and remove_tags input processors

I hope this makes the use case a bit more clear

Matthijsy avatar Mar 26 '19 11:03 Matthijsy

So, if I got it right, you are saying that you have:

class MyItemLoader(ItemLoader):
    default_input_processor = some_input_processor
    some_field_in = MapCompose(some_input_processor, another_input_processor)

How would you like that to look like instead?

Gallaecio avatar Mar 26 '19 12:03 Gallaecio

No we don't use the some_field_in method, we use it like this (current style)

class MyItemLoader(ItemLoader):
    default_input_processor = some_input_processor

class User(scrapy.Item):
    name = scrapy.Field()
    height = scrapy.Field(input_processor=MapCompose(some_input_processor, another_input_processor))

But I would like to do it in this style

class MyItemLoader(ItemLoader):
    default_input_processor = some_input_processor

class User(scrapy.Item):
    name = scrapy.Field()
    height = scrapy.Field(add_input=another_input_processor)

This way I don't have to duplicate the some_input_processor

Matthijsy avatar Mar 26 '19 14:03 Matthijsy

@Gallaecio Do you understand the problem now? Or do I need to explain more?

Matthijsy avatar Jun 04 '19 14:06 Matthijsy

I think I understand, although I personally don’t like coupling Item’s Field class and the ItemLoader class that way.

Gallaecio avatar Jun 04 '19 14:06 Gallaecio