FronteraScheduler._request_is_redirected looks suspicious
See https://github.com/scrapinghub/frontera/blob/d91e05631688815f7255ae29f2bfe095621f9540/frontera/contrib/scrapy/schedulers/frontier.py#L169:
def _request_is_redirected(self, request):
return request.meta.get(b'redirect_times', 0) > 0
^^ in Python 2 this is the same as checking for 'redirect_times' key, but in Python 3 b'redirect_times' != 'redirect_times', so this condition won't work with Scrapy's redirect middlewares in Python 3.
Tests are not catching that because in scheduler tests scrapy.Request objects are created with bytes keys in meta: https://github.com/scrapinghub/frontera/blob/d91e05631688815f7255ae29f2bfe095621f9540/tests/test_frontera_scheduler.py#L20. Scrapy don't create such responses: it always use native strings (str, i.e. bytes in Python 2 and uncode in Python 3) as meta keys.
also look at https://github.com/scrapinghub/frontera/issues/211
@voith yeah, that's what I was looking at right now :)
@kmike I'm not sure the code using that method is executed https://github.com/scrapinghub/frontera/blob/d91e05631688815f7255ae29f2bfe095621f9540/frontera/contrib/scrapy/schedulers/frontier.py#L92
and honestly I don't see much sense in all this. As I can see this check is executed only if
- there is consumption of
start_requestsiterable in scrapy spider, - some downloader middleware throws Request when processing output.
and then there is a check for redirection and depending on it bypassing of all manager/backend machinery. Looks like an artefact left from early alpha versions.
I got bitten by this bug. When redirect is enabled, the symptom is that the frontera gets the first seed request, receives the redirect, then adds the redirected URL as seed again and nothing happens afterwards.
I haven't verified but I think the scrapy manager handles scrapy's Request objects, and as scrapy sets the redirect_times, the key it's a native string. As @kmike pointed out.
I changed the line to request.meta.get('redirect_times') > 0 and it's working fine.