frontera icon indicating copy to clipboard operation
frontera copied to clipboard

exception during scrapy callback marked as queued

Open RajatGoyal opened this issue 10 years ago • 8 comments

Hi, If there is any exception with response parsing in scrapy, the request remain marked as QUEUED and no error is logged on the frontier.

RajatGoyal avatar Sep 07 '15 07:09 RajatGoyal

Good finding, actually. This could happen because of redirects. When redirect happens, Frontera will get a response object with last (already redirected) URL and will not match it with record in database. Therefore, it will create a new record and mark it as CRAWLED, and old one remain QUEUED.

There is canonical solvers mechanism which should be returning canonical URL, https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/canonicalsolvers/basic.py

depending on that we could mark old record as CRAWLED also. But that needs to be coded, PR is welcome as usual.

sibiryakov avatar Sep 07 '15 09:09 sibiryakov

This is happening for every case and not just for redirects. To test I wrote a simple spider with sqlalchemy backend:

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = (
        'http://www.example.com/',
    )

    def parse(self, response):
        raise Exception("Test Exception")

If I run this, error due to exception is not getting set in database.

RajatGoyal avatar Sep 07 '15 09:09 RajatGoyal

This is again a good finding, @RajatGoyal! We could solve that by handling exceptions in spider middleware, and propagating them to backend. If you could make a PR, that would be awesome!

sibiryakov avatar Sep 07 '15 10:09 sibiryakov

any idea on how to propagate it to the backend, we can't get manager from the spider middleware?

RajatGoyal avatar Sep 07 '15 10:09 RajatGoyal

We need to adapt interfaces in FronteraManagerWrapper, FronteraManager and Backend. I think we need to propagate type (error happened during response processing), response itself, along with error structure.

sibiryakov avatar Sep 07 '15 11:09 sibiryakov

I have fixed this, but I don't have write permission to push a branch now, it gives 403 response.

RajatGoyal avatar Sep 07 '15 14:09 RajatGoyal

Can you fork Frontera to your own account, and use your local branch?

sibiryakov avatar Sep 07 '15 14:09 sibiryakov

@sibiryakov Take a look at the above branch.

RajatGoyal avatar Sep 07 '15 16:09 RajatGoyal