Extend ItemLoader processors
Currently there are three methods to add ItemLoader processor:
- The
default_input/output_processoron theItemLoaderclass - The
field_name_in/outon theItemLoaderclass - The
input/output_processoron thescrapy.Field
Personally I use the input/output_processor on the scrapy.Field combined with the default_input/output_processor a lot. But I use those in combination. Often I just want to add one more processor after the default processors. Since input/output_processor on scrapy.Field does a override of the defaults this is quite hard to do.
So I would propose to add another method to add a input/output processors. I would like to have something like add_input/output on the scrapy.Field, which would add the specified processor to the default processor.
I did implement this on my own ItemLoader class but think that it would be usefull for the scrapy core. My implementation is as follows (original source: https://github.com/scrapy/scrapy/blob/master/scrapy/loader/init.py#L69). Ofcourse this can be added to get_output_processor in the same way.
def get_input_processor(self, field_name):
proc = getattr(self, '%s_in' % field_name, None)
if not proc:
override_proc = self._get_item_field_attr(field_name, 'input_processor')
extend_proc = self._get_item_field_attr(field_name, 'add_input')
if override_proc and extend_proc:
raise ValueError(f'Not allowed to define input_processor and add_input to {field_name}')
if override_proc:
return override_proc
elif extend_proc:
return Compose(self.default_input_processor, extend_proc)
return self.default_input_processor
return proc
I am not sure if add_input is a good name, probably extend_input_processor is more clear but this quite a long name. I would like to hear if more people are wanting this feature and what you all think about what the naming should be.
I am having a hard time trying to picture what you want done, even after reading the documentation from your pull request.
Could you provide some sample code that shows the (convoluted) way to achieve your goal before your changes, and the (simpler) way to do the same after your suggested changes?
We have a custom itemloader which specifies some default input processors. Some fields needs an additional input processor which is specific for this field. Now we need to copy the default input processor and add the new one. We we want to add a new default input processor in the future we have to do this at all those fields again. I would like to be able to keep the default input processor and only extend it with one new input processor.
Example:
We have as default input processor a strip and remove tags function to clean some data.
Now we have a scrapy item representing an person with two fields:
- Name -> for this field those input processors are fine
- Height -> This data is in the format 1.50m for example but we want to have it in cm as integer (150). So we want to add an new input processor to do this, but keep the
stripandremove_tagsinput processors
I hope this makes the use case a bit more clear
So, if I got it right, you are saying that you have:
class MyItemLoader(ItemLoader):
default_input_processor = some_input_processor
some_field_in = MapCompose(some_input_processor, another_input_processor)
How would you like that to look like instead?
No we don't use the some_field_in method, we use it like this (current style)
class MyItemLoader(ItemLoader):
default_input_processor = some_input_processor
class User(scrapy.Item):
name = scrapy.Field()
height = scrapy.Field(input_processor=MapCompose(some_input_processor, another_input_processor))
But I would like to do it in this style
class MyItemLoader(ItemLoader):
default_input_processor = some_input_processor
class User(scrapy.Item):
name = scrapy.Field()
height = scrapy.Field(add_input=another_input_processor)
This way I don't have to duplicate the some_input_processor
@Gallaecio Do you understand the problem now? Or do I need to explain more?
I think I understand, although I personally don’t like coupling Item’s Field class and the ItemLoader class that way.