frontera icon indicating copy to clipboard operation
frontera copied to clipboard

FronteraScheduler._request_is_redirected looks suspicious

Open kmike opened this issue 8 years ago • 4 comments

See https://github.com/scrapinghub/frontera/blob/d91e05631688815f7255ae29f2bfe095621f9540/frontera/contrib/scrapy/schedulers/frontier.py#L169:

    def _request_is_redirected(self, request):
        return request.meta.get(b'redirect_times', 0) > 0

^^ in Python 2 this is the same as checking for 'redirect_times' key, but in Python 3 b'redirect_times' != 'redirect_times', so this condition won't work with Scrapy's redirect middlewares in Python 3.

Tests are not catching that because in scheduler tests scrapy.Request objects are created with bytes keys in meta: https://github.com/scrapinghub/frontera/blob/d91e05631688815f7255ae29f2bfe095621f9540/tests/test_frontera_scheduler.py#L20. Scrapy don't create such responses: it always use native strings (str, i.e. bytes in Python 2 and uncode in Python 3) as meta keys.

kmike avatar Jan 19 '17 20:01 kmike

also look at https://github.com/scrapinghub/frontera/issues/211

voith avatar Jan 19 '17 21:01 voith

@voith yeah, that's what I was looking at right now :)

kmike avatar Jan 19 '17 21:01 kmike

@kmike I'm not sure the code using that method is executed https://github.com/scrapinghub/frontera/blob/d91e05631688815f7255ae29f2bfe095621f9540/frontera/contrib/scrapy/schedulers/frontier.py#L92

and honestly I don't see much sense in all this. As I can see this check is executed only if

  1. there is consumption of start_requests iterable in scrapy spider,
  2. some downloader middleware throws Request when processing output.

and then there is a check for redirection and depending on it bypassing of all manager/backend machinery. Looks like an artefact left from early alpha versions.

sibiryakov avatar Jan 23 '17 12:01 sibiryakov

I got bitten by this bug. When redirect is enabled, the symptom is that the frontera gets the first seed request, receives the redirect, then adds the redirected URL as seed again and nothing happens afterwards.

I haven't verified but I think the scrapy manager handles scrapy's Request objects, and as scrapy sets the redirect_times, the key it's a native string. As @kmike pointed out.

I changed the line to request.meta.get('redirect_times') > 0 and it's working fine.

rmax avatar Feb 22 '17 13:02 rmax