scrapyrt icon indicating copy to clipboard operation
scrapyrt copied to clipboard

Search Page returns empty through scrapyrt only

Open keyiyek opened this issue 3 years ago • 3 comments

(Sorry can't find how to label this) I hope this is the right place where to ask this.

I created a spider that can scrape a page in an e-commerce site and gather the data on the different items. The spider works fine with specific pages of the site (www.sitedomain/123-item-category), as well as with the search page (www.sitedomain/searchpage?controller?search=keywords+item+to+be+found).

But, when I run it through scrapyrt the specific page works fine, but the search page returns 0 items. No errors, just 0 items.This occurs on 2 different sites with 2 different spiders.

Is there something specific to search pages that has to be taken in account when using scrapyrt?

keyiyek avatar Dec 09 '20 20:12 keyiyek

Can you post your spider code? I don't see a way to reproduce it without spider code. Try to pinpoint the problem so that there is small code sample of spider running in raw ScrapyRT (without any middlewares, pipelines and other stuff from your project intefering). This way we can see this is problem on ScrapyRT side.

pawelmhm avatar Jan 29 '21 12:01 pawelmhm

yes, sure.

so, my spider, stripped of all other suff looks like this:

`import scrapy

class QuotesSpider(scrapy.Spider): name = "minimal"

def start_requests(self):
    urls = [
       "https://www.dungeondice.it/ricerca?controller=search&s=ticket+to+ride",
    ]
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)
        

def parse(self, response):
    print("Found ", len(response.css("article")), " items")
    for article in response.css("article"):
        print("Item: ", [article.css("img::attr(title)").get())`]

and I set Obey_robots = False

when I do

scrape crawl minimal

I get 20 items in the response, but if I go

curl "http://localhost:9081/crawl.json?spider_name=minimal&url=https://www.dungeondice.it/ricerca?controller=search&s=ticket+to+ride"

I get 0 items, no error, just 0 items. I wonder if, in some way, returns the results before the page gets completely loaded?

(sorry couldn't get the markup to work correctly)

keyiyek avatar Jan 29 '21 12:01 keyiyek

Seems that when there is '&' on the url. scrapyrt split it right before the &.

Yansuko avatar Feb 03 '22 08:02 Yansuko