scrapyrt
scrapyrt copied to clipboard
[WIP] Add max_items
Close #107
Thank you for this contribution @serhii73. I really appreciate it!
The idea behind this feature is to limit number of items in output. With current implementation it doesn't really do the trick. You can test it by running curl
> curl "http://localhost:9080/crawl.json?url=http://quotes.toscrape.com&spider_name=toscrape-css&max_items=1" -v | jq '.items'
I passed max_items=1 but I got 10 items.
This is because limit_items function is not called when item is scraped. To actually call it when item is scraped you need to plug it into signal item_scraped. There is already one handler registered for this signal it is method "get_item". If you move your code counting items there it should do the trick.
:+1: for adding test, test is ok, but it would be also good to add one integration test in test_resource_crawl.TestCrawlResourceIntegration passing max_items parameter to API and checking if it actually works as expected. It is possible that some argument will be work fine in crawl manager but will be filtered earlier in HTTP resource handling. This is why it is recommended to add integration test.
Finally, when we merge it we'll need to also update documentation with this parameter so that people know how to use it.