frontera FronteraScheduler._request_is_redirected looks suspicious

See https://github.com/scrapinghub/frontera/blob/d91e05631688815f7255ae29f2bfe095621f9540/frontera/contrib/scrapy/schedulers/frontier.py#L169:

    def _request_is_redirected(self, request):
        return request.meta.get(b'redirect_times', 0) > 0

^^ in Python 2 this is the same as checking for 'redirect_times' key, but in Python 3 b'redirect_times' != 'redirect_times', so this condition won't work with Scrapy's redirect middlewares in Python 3.

Tests are not catching that because in scheduler tests scrapy.Request objects are created with bytes keys in meta: https://github.com/scrapinghub/frontera/blob/d91e05631688815f7255ae29f2bfe095621f9540/tests/test_frontera_scheduler.py#L20. Scrapy don't create such responses: it always use native strings (str, i.e. bytes in Python 2 and uncode in Python 3) as meta keys.

Jan 19 '17 20:01 kmike

also look at https://github.com/scrapinghub/frontera/issues/211

Jan 19 '17 21:01 voith

@voith yeah, that's what I was looking at right now :)

Jan 19 '17 21:01 kmike

@kmike I'm not sure the code using that method is executed https://github.com/scrapinghub/frontera/blob/d91e05631688815f7255ae29f2bfe095621f9540/frontera/contrib/scrapy/schedulers/frontier.py#L92

and honestly I don't see much sense in all this. As I can see this check is executed only if

there is consumption of start_requests iterable in scrapy spider,
some downloader middleware throws Request when processing output.

and then there is a check for redirection and depending on it bypassing of all manager/backend machinery. Looks like an artefact left from early alpha versions.

Jan 23 '17 12:01 sibiryakov

I got bitten by this bug. When redirect is enabled, the symptom is that the frontera gets the first seed request, receives the redirect, then adds the redirected URL as seed again and nothing happens afterwards.

I haven't verified but I think the scrapy manager handles scrapy's Request objects, and as scrapy sets the redirect_times, the key it's a native string. As @kmike pointed out.

I changed the line to request.meta.get('redirect_times') > 0 and it's working fine.

Feb 22 '17 13:02 rmax