crawly icon indicating copy to clipboard operation
crawly copied to clipboard

custom parsar callback sample

Open ziyouchutuwenwu opened this issue 4 years ago • 7 comments

hi, is there any sample which shows how to use custom parsar callback instead of use default parse_item? i read doc from here, but don't know how to use.

thanks for your help

ziyouchutuwenwu avatar Jul 30 '21 05:07 ziyouchutuwenwu

@Ziinc probably can give more info here.

But could you please describe the use case? Why can't you use parse_item?

oltarasenko avatar Jul 30 '21 09:07 oltarasenko

here is my usage scenario:

for site demo.com, i need to get some info such as title, category for the main page. and get the sub url from some links when i get the sub url, i send requests, then parse data from response, here i need to get some detail info, such as author, price and etc.

the data parsar from sub page should be different from main page, i don't know how to do it through crawly.

great thanks.

ziyouchutuwenwu avatar Jul 30 '21 10:07 ziyouchutuwenwu

for python part, my demo code seems like this image

ziyouchutuwenwu avatar Jul 30 '21 10:07 ziyouchutuwenwu

So... Do you have different items on different pages? Or same data just structured differently?

oltarasenko avatar Jul 30 '21 16:07 oltarasenko

yes, basiclly, i have different data structure on different pages, but according to the sample code, i don't know how to write the code. It will be appreciate if there are some examples that can help me.

ziyouchutuwenwu avatar Jul 31 '21 00:07 ziyouchutuwenwu

Sorry I still don't understand if that's one of these two:

  1. Same item which can be extracted with other selectors
  2. Two different items

oltarasenko avatar Aug 01 '21 15:08 oltarasenko

sorry @ziyouchutuwenwu I only just saw this, must have missed the ping.

Parsers are meant for commonly used logic that you want to reuse across spiders. A parser is simply a Pipeline module, with the result of each Parser being passed to the next. The opts 3rd positional arg allows you to provide spider-specific configuration to your parser.

For example, on site 1, you want to extract all links with a h1 tag but filter them out based on some site-specific filter function, and build requests from all extracted links:

# spider 1
parsers: [
  {MyCustomRequestParser, [selector: ".h1", filter: &my_filter_function/1]}
]

Then, in spider 2 that is crawling site 2, we only want h2 tags, but without using any filtering:

# spider 2
parsers: [
  {MyCustomRequestParser, [selector: ".h2"]}
]

Then your MyCustomRequestParser.run/3 contains the logic required to select and build the requests

Ziinc avatar Sep 07 '21 17:09 Ziinc