aiohttp icon indicating copy to clipboard operation
aiohttp copied to clipboard

Redirect not been followed

Open delijati opened this issue 8 years ago • 16 comments

Long story short

Redirect not been followed. I always get a 302 back.

Expected behaviour

Get a 200 back with text filled.

Steps to reproduce

import aiohttp
import asyncio


async def fetch(session, url):
    async with session.get(url, allow_redirects=True) as response:
        return await response.text(), response.status


async def fetch_all(session, urls, loop):
    results = await asyncio.gather(
        *[fetch(session, url) for url in urls],
    )

    for ret in results:
        print("status: %s len: %s" % (ret[1], len(ret[0])))
    return results

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    urls = ['https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python']
    with aiohttp.ClientSession(loop=loop) as session:
        loop.run_until_complete(fetch_all(session, urls, loop))

delijati avatar Nov 10 '17 11:11 delijati

Try curl -v -L 'https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python' - it produce redirection loop. It looks like not a aiohttp, but server bug.

kxepal avatar Nov 10 '17 11:11 kxepal

with requests

In [1]: import requests
In [2]: res = requests.get("https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python")
In [3]: res.status_code
Out[3]: 200
In [4]: len(res.text)
Out[4]: 171410

delijati avatar Nov 10 '17 11:11 delijati

Not sure what requests does, but even browsers reports about improper redirection caused by this URL. I'm sure we can investigate what the request does to handle it - it's very interesting how -, but I guess it's not our problem if reference HTTP client and browsers tells that there is a redirection issue.

kxepal avatar Nov 10 '17 11:11 kxepal

Narrowed it down to setting cookies. http.cookies.__parse_string aka http.cookies.SimpleCookie.load can't parse this cookie [1] str. requests is using http.client.parse_headers.

Possible solutions:

  • fix http.cookies.__parse_string tested to be broken in 3.5, 3.6
  • replace SimpleCookie with a patched version
  • None of above as the cookie is broken and http.client.parse_headers

[1] JSESSION_ID_PROFILE_APP=A3C47AF767CCC88FA36B8292A6FB94A3; Path=/; HttpOnly, gp_intern=(null);path=/;Secure, gp_intern=(null);path=/;Secure

delijati avatar Nov 10 '17 15:11 delijati

Nice found! So the problem looks like that aiohttp doesn't set / preserve cookies during redirects and that site sets them though thme (horrible solution). But at least now it's clear why happens what we see with curl. Would you like to submit PR?

kxepal avatar Nov 10 '17 17:11 kxepal

Not sure if we should fix it. gp_intern is not standard property for cookies, http.cookies raises an exception for it. Actually the property name is HttpOnly, gp_intern and value is (null). Another property is Secure, gp_intern with (null) value. For me it looks like a garbage. Multiple Path properties are not supported by standard too.

asvetlov avatar Nov 10 '17 18:11 asvetlov

@asvetlov For me I see it as a buggy server which forces to set cookies via redirect to pass you through. Honestly, it's horrible solution, but technically it may be correct.

kxepal avatar Nov 10 '17 18:11 kxepal

The site is protected from usage by correctly implemented client libraries. Well, @delijati could pass allow_redirects=False to session.get() and process redirections on his own.

Honestly I have no idea where and when we will stop If we try process every buggy quirk out of the box.

asvetlov avatar Nov 10 '17 18:11 asvetlov

@asvetlov I'm kind of unsure what would be right solution. Fixing it feels like a ie6 css quirks but any browser I tested sets the the cookie.

Tested with Node:

var request = require('request');

const url = 'https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python';

request(url, function(error, response, body) {
    console.log('error:', error); // Print the error if one occurred
    console.log('statusCode:', response && response.statusCode); // Print the response status code if a response was received
    console.log('body:', body); // Print the HTML for the Google homepage.
});

Error:

(node:22299) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 pipe listeners added. Use emitter.setMaxListeners() to increase limit
error: Error: Exceeded maxRedirects. Probably stuck in a redirect loop https://www.gulp.de/gulp2/projekte/suche?scope=projects&query=python
    at Redirect.onResponse (/opt/develop/privat/mobile/death/node_modules/request/lib/redirect.js:98:27)
    at Request.onRequestResponse (/opt/develop/privat/mobile/death/node_modules/request/request.js:996:22)
    at emitOne (events.js:96:13)
    at ClientRequest.emit (events.js:188:7)
    at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:473:21)
    at HTTPParser.parserOnHeadersComplete (_http_common.js:99:23)
    at TLSSocket.socketOnData (_http_client.js:362:20)
    at emitOne (events.js:96:13)
    at TLSSocket.emit (events.js:188:7)
    at readableAddChunk (_stream_readable.js:176:18)
statusCode: undefined
body: undefined

delijati avatar Nov 10 '17 19:11 delijati

I went a bit deeper the rabbit hole. aiohttp uses SimpleCookie which implements the spec rfc2109. requests uses Cookie the newer rfc2965 witch obsolets spec rfc2109.

Firefox on the other side implements something custom RFC2109/2616

So long story short aiohttp should use the newer Cookie implementation.

delijati avatar Nov 11 '17 15:11 delijati

Thanks for shading a light on the problem.

aiohttp cannot just use http.cookiejar.Cookie -- it has incompatible interface with http.cookies.SimpleCookie. rfc2965 is obsoleted by RFC6265 BTW. The RFC explicitly enumerates all allowed cookie properties (and header name is Set-Cookie, Set-Cookie2 is obsolete). The list of properties is fixed. For SameSite where is a new RFC draft but it is not accepted yet.

asvetlov avatar Nov 13 '17 17:11 asvetlov

Sorry, the RFC specifies extension-av rule as extension-av = <any CHAR except CTLs or ";">.

Ideas?

asvetlov avatar Nov 13 '17 17:11 asvetlov

SimpleCookie.__parsestring is just too strict: it also fails to handle expiration dates incorrectly (parses them based on the sane-cookie-date only), see https://bugs.python.org/issue35824#msg334392. (this means that aiohttp is affected by this (https://github.com/aio-libs/aiohttp/blob/9504fe2affaaff673fa4f3754c1c44221f8ba47d/aiohttp/cookiejar.py#L175), i.e. it cannot handle the following cookie from GitHub: Set-Cookie: has_recent_activity=1; path=/; expires=Mon, 22 Apr 2019 23:27:18 -0000 (due to "-0000" instead of "GMT" at the end).

The interface for

aiohttp cannot just use http.cookiejar.Cookie -- it has incompatible interface with http.cookies.SimpleCookie.

But couldn't it be changed to be based on http.cookies.CookieJar (similar to how requests does it)?

Another/simpler solution might be to pass an adjusted patt to __parsestring (using it directly).

Currently aiohttp fails to handle 3 out of 4 cookies sent by github.com:

   5     async def main():
   6         cjar = aiohttp.CookieJar()
   7         async with aiohttp.ClientSession(cookie_jar=cjar) as session:
   8             async with session.get('https://github.com') as response:
   9                 pass
  10  ->     __import__('pdb').set_trace()
 return None
(Pdb++) pp response.headers.getall("set-cookie")
['has_recent_activity=1; path=/; expires=Fri, 03 May 2019 00:10:35 -0000',
 '_octo=GH1.1.393119700.1556838635; domain=.github.com; path=/; expires=Sun, 02 May 2021 23:10:35 -0000',
 'logged_in=no; domain=.github.com; path=/; expires=Mon, 02 May 2039 23:10:35 -0000; secure; HttpOnly',
 '_gh_sess=NXXXf; path=/; secure; HttpOnly']
(Pdb++) pp len(response.headers.getall("set-cookie"))
4
(Pdb++) pp list(cjar)
[{'comment': '',
  'domain': 'github.com',
  'expires': '',
  'httponly': True,
  'max-age': '',
  'path': '/',
  'samesite': '',
  'secure': True,
  'version': ''}]
(Pdb++) pp len(list(cjar))
1

blueyed avatar May 02 '19 23:05 blueyed

@asvetlov For me I see it as a buggy server which forces to set cookies via redirect to pass you through. Honestly, it's horrible solution, but technically it may be correct.

From Section 3 in RFC6265:

[...] User agents MAY ignore Set-Cookie headers contained in responses with 100-level status codes but MUST process Set-Cookie headers contained in other responses [...]

I'm not sure if that server is/was buggy, but the client implementation is broken w.r.t. that RFC when it ignores cookies set in HTTP 3xx responses. On the other hand, the aiohttp documentation states:

Response cookies contain only values, that were in Set-Cookie headers of the last request in redirection chain.

I found it surprising that cookies set on redirection (HTTP 302 where I've tried) are discarded by aiohttp, though I'm always using a ClientSession instance. I still don't know the reason but now the allow_redirects need to be always False for a scraper I'm working in. For now I'm applying the consecutive requests one by one as a workaround (I'm using aiohttp 3.5.4, Python 3.7.3).

danilobellini avatar May 21 '19 21:05 danilobellini

If somebody wants to propose a pull request with the fix -- I'll be happy to review it.

asvetlov avatar May 22 '19 07:05 asvetlov

http.cookiejar.parse_ns_headers might be a good solution here as it works correctly

https://github.com/aio-libs/aiohttp/pull/11106#issuecomment-2927246152

bdraco avatar Jun 01 '25 14:06 bdraco