Small improvement to the docs for setting ITEM_PIPELINES
In the docs
https://github.com/scrapy/scrapy/blob/65d631329a1434ec013f24341e4b8520241aec70/scrapy/templates/project/module/pipelines.py.tmpl
It says, in the comments:
Define your item pipelines here
Don't forget to add your pipeline to the ITEM_PIPELINES setting See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
Please change the instruction to:
Don't forget to add your pipeline to the ITEM_PIPELINES setting in settings.py
I added the setting to my spider's init, and it was hard to find out what was going wrong. Mentioning settings.py would help others who make the same mistake.
The thing is, settings.py is just one of the places where it can be defined, and I think mentioning all places where you can define settings everywhere we mention a setting in the documentation would make things too verbose.
I understand your frustration, but I’m not sure how we can improve things. Users are expected to have read https://docs.scrapy.org/en/latest/topics/settings.html#populating-the-settings by the time they look up specific settings in the documentation.
Out of curiosity, why does an Item need to be declared in ITEM_PIPELINES in order to be processed?
I'm learning scrapy and this just bit me -- I was yielding an Item subclass with a process_item method, but process_item wasn't called until I added my class to ITEM_PIPELINES.
This was counterintuitive to me as a learner -- is there a reason someone would yield an Item without wanting its process method to be called?
Related issue for the pipeline docs: #2350
I kind of agree with 2350 -- I'm an experienced python programmer, but it took me a while to figure out the item pipeline from docs. I couldn't find a complete example -- the entire 'item pipelines' docs page, for example, doesn't have the yield keyword anywhere. A small self-contained example (which includes the ITEM_PIPELINES reminder) would have helped a lot.
Happy to submit a (small) docs PR if helpful, but fair warning I'm not a scrapy expert.
I'm learning scrapy and this just bit me -- I was
yielding an Item subclass with aprocess_itemmethod, butprocess_itemwasn't called until I added my class to ITEM_PIPELINES.This was counterintuitive to me as a learner -- is there a reason someone would yield an Item without wanting its process method to be called?
This is the first time I hear of someone defining a process_item method on an item class itself. Item pipeline classes are intended to be separate from item classes, it is not customary to use an item class also as an item pipeline.
ahhh that makes sense -- I misunderstood the API here
fwiw it would really help to add an end-to-end example in the 'item pipelines' docs page
one that included yielding from a spider