aiohttp icon indicating copy to clipboard operation
aiohttp copied to clipboard

aiohttp client throws http errors for the following redirect

Open wumpus opened this issue 7 years ago • 22 comments

Long story short

I have been fetching the front pages of millions of websites using aiohttp, and collected a large number of cases where aiohttp client's http parser throws errors for stuff that browsers appear to think is fine. Some of these are real bugs in aiohttp's parser, others might be places where browsers do not obey the standard, and aiohttp might want to be more forgiving.

Here's an initial bug to see if you'd like me to do more triage on these.

Here is a 302 redirect that seems to work fine in curl and Firefox but aiohttp's http parser pukes on it.

Expected behaviour

$ curl http://lund.se/robots.txt -D /dev/tty
HTTP/1.1 302 Object Moved
Date: Tue, 26 Dec 2017 05:47:50 GMT
Connection: Keep-Alive
Content-Length: 0
Location: https://lund.se/robots.txt

Note Content-Length: 0.

If I tell curl to follow the redirect:

$ curl -L http://lund.se/robots.txt -D /dev/tty

that works and I see the actual https robots.txt file. My browsers also follow this redirect.

Actual behaviour

bug.py throws:

aiohttp.client_exceptions.ClientResponseError: 400, message='invalid constant string'

Steps to reproduce

import aiohttp
import asyncio

async def fetch(session, url):
        async with session.get(url) as response:
            return await response.text()

async def main():
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, 'http://lund.se/robots.txt')
        print(html)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

$ python bug.py

More examples

Since I'm crawling a lot of terrible websites, I have an easy ability to find more examples.

Other messages

OK in FireFox -- 301 to their frontpage. curl also follows this redir:
ClientPayloadError("400, message='Can not decode content-encoding: gzip'",) http://www.fusioncashsurveys.com/robots.txt

OK in FireFox -- 200 and a 5 line robots.txt:
(notice also that the message appears truncated?)
ClientResponseError("400, message='deflate'",) http://www.raundaz.com/robots.txt

301 OK in FireFox and curl: (at	least the message isn't	truncated this time)
ClientPayloadError("400, message='Can not decode content-encoding: deflate'",) http://www.labfortraining.it/

OK in FireFox, indeed it has a huge Content-Security-Policy header:
ClientResponseError("400, message='Got more than 8190 bytes when reading Header value is too long.'",) http://www.dakotabox.es/robots.txt

OK in FireFox, I suppose 'N/A:' is an invalid response header name
ClientResponseError("400, message='invalid character in header'",) http://www.charteroak.edu/robots.txt

Bad in FireFox, too, not a bug
ClientResponseError("400, message='invalid HTTP version'",) http://www.dgchangan.com/robots.txt

OK in FireFox, I see 2 Content-Length headers:
ClientResponseError("400, message='unexpected content-length header'",) http://www.ao30free.com/robots.txt

200, OK in FireFox, content-length looks OK to me
ClientResponseError("400, message='unexpected content-length header'",) http://www.bookfeeder.com/

Your environment

aiohttp 2.3.6 CLIENT Python 3.6.4 Linux (CentOS 7.4.1708)

wumpus avatar Dec 26 '17 07:12 wumpus

Thanks for report

asvetlov avatar Dec 30 '17 14:12 asvetlov

I am also getting this error "aiohttp.client_exceptions.ClientResponseError: 400, message='unexpected content-length header' snippet of code is

import aiohttp
import asyncio
import async_timeout

async def fetch(session, url):
	with async_timeout.timeout(10):
		async with session.get(url) as response:
			return await response.text()

async def main():
	headers = {}
	headers['Authorization'] = 'Basic xxxxxxx==='
	headers['Content-Type']  = 'application/x-www-form-urlencoded'
	headers['header1'] = 'somevalue'
	async with aiohttp.ClientSession(headers = headers) as session:
		for i in range(100):
			html = await fetch(session, url)
			print("\n" ,html)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

koulVipin avatar Jan 24 '18 10:01 koulVipin

Most likely the server responds with at least two Content-Length headers, the response is invalid.

asvetlov avatar Jan 24 '18 10:01 asvetlov

Thanks to Postel's Law, many webservers emit invalid http. This is similar to many webpages being invalid html. Yet browsers display these pages. The html5 standard now standardizes how everyone is supposed to treat broken html; no such standard exists for broken http.

I'd like to know (1) are you interested in fixing this to work like browsers or (2) will you take patches that fix it to work like browsers or (3) aiohttp is a thing of beauty which perfectly implements the standard :-)

For (1) I can provide a large number of test cases, and help triage them. For (2) I can write patches for the things which are most common in my web crawls. For (3) I will admire your idealism.

wumpus avatar Jan 28 '18 04:01 wumpus

I definitely prefer option (2), but let's discuss fixes case by case. Sorry, I not very motivated to fix weird cases myself (at least while they don't hurt me on my job for example) but I'm open for reviewing and accepting patches to improve situation.

asvetlov avatar Jan 28 '18 08:01 asvetlov

ClientResponseError("400, message='invalid character in header'",) http://www.charteroak.edu/robots.txt```
Do you have workaround to this error? I do not need to read header only body

iho avatar Feb 12 '18 12:02 iho

@iho just install aiohttp 3.0

asvetlov avatar Feb 12 '18 13:02 asvetlov

pip install aiohttp==3.0
Requirement already satisfied: aiohttp==3.0 in /home/user/.pyenv/versions/3.6.4/envs/flyp/lib/python3.6/site-packages
Requirement already satisfied: idna-ssl>=1.0 in /home/user/.pyenv/versions/3.6.4/envs/flyp/lib/python3.6/site-packages (from aiohttp==3.0)
Requirement already satisfied: multidict<5.0,>=4.0 in /home/user/.pyenv/versions/3.6.4/envs/flyp/lib/python3.6/site-packages (from aiohttp==3.0)
Requirement already satisfied: chardet<4.0,>=2.0 in /home/user/.pyenv/versions/3.6.4/envs/flyp/lib/python3.6/site-packages (from aiohttp==3.0)
Requirement already satisfied: async-timeout<2.0,>=1.2 in /home/user/.pyenv/versions/3.6.4/envs/flyp/lib/python3.6/site-packages (from aiohttp==3.0)
Requirement already satisfied: attrs>=17.4.0 in /home/user/.pyenv/versions/3.6.4/envs/flyp/lib/python3.6/site-packages (from aiohttp==3.0)
Requirement already satisfied: yarl<2.0,>=1.0 in /home/user/.pyenv/versions/3.6.4/envs/flyp/lib/python3.6/site-packages (from aiohttp==3.0)
Requirement already satisfied: idna>=2.0 in /home/user/.pyenv/versions/3.6.4/envs/flyp/lib/python3.6/site-packages (from idna-ssl>=1.0->aiohttp==3.0)

Example of code

import aiohttp
import asyncio

async def main():
    url = 'https://flyp.me/api/v1/order/create'

    data = {
      "order": {
      "from_currency": "LTC",
      "to_currency": "ZEC",
      "ordered_amount": "0.01",
      "destination": "t1SBTywpsDMKndjogkXhZZSKdVbhadt3rVt"
      }
    }
    async with aiohttp.ClientSession() as session:
        async with session.post(url, json=data) as response:
            print(await response.text())

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Traceback

Traceback (most recent call last):
  File "/home/user/.pyenv/versions/flyp/lib/python3.6/site-packages/aiohttp/client_reqrep.py", line 678, in start
    (message, payload) = await self._protocol.read()
  File "/home/user/.pyenv/versions/flyp/lib/python3.6/site-packages/aiohttp/streams.py", line 533, in read
    await self._waiter
  File "/home/user/.pyenv/versions/flyp/lib/python3.6/site-packages/aiohttp/client_proto.py", line 161, in data_received
    messages, upgraded, tail = self._parser.feed_data(data)
  File "aiohttp\_http_parser.pyx", line 295, in aiohttp._http_parser.HttpParser.feed_data
aiohttp.http_exceptions.BadHttpMessage: 400, message='invalid character in header'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "example.py", line 20, in <module>
    loop.run_until_complete(main())
  File "/home/user/.pyenv/versions/3.6.4/lib/python3.6/asyncio/base_events.py", line 467, in run_until_complete
    return future.result()
  File "example.py", line 16, in main
    async with session.post(url, json=data) as response:
  File "/home/user/.pyenv/versions/flyp/lib/python3.6/site-packages/aiohttp/client.py", line 779, in __aenter__
    self._resp = await self._coro
  File "/home/user/.pyenv/versions/flyp/lib/python3.6/site-packages/aiohttp/client.py", line 331, in _request
    await resp.start(conn, read_until_eof)
  File "/home/user/.pyenv/versions/flyp/lib/python3.6/site-packages/aiohttp/client_reqrep.py", line 683, in start
    message=exc.message, headers=exc.headers) from exc
aiohttp.client_exceptions.ClientResponseError: 400, message='invalid character in header'

iho avatar Feb 12 '18 14:02 iho

The problem is parsing the response by upstream nodejs HTTP parser. AIOHTTP_NO_EXTENSIONS environment variable disables the fast C parser, pure Python fallback process the response correctly.

asvetlov avatar Feb 12 '18 15:02 asvetlov

worth upgrading vendored lib

webknjaz avatar Feb 12 '18 15:02 webknjaz

@asvetlov thank you!

iho avatar Feb 12 '18 15:02 iho

@webknjaz upstream didn't fix the problem, it has added support for SOURCE HTTP verb only.

asvetlov avatar Feb 12 '18 16:02 asvetlov

Another example of getting ClientResponseError: 400, message='invalid constant string' from a well-behaving (I think) web service, using aiohttp==3.1.0. Code to reproduce:

import asyncio
import aiohttp

async def main():
    async with aiohttp.ClientSession() as session:
        async with session.delete('http://proxy.crawlera.com:8010/session/foo') as response:
            print(repr(response))

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

gives:

Traceback (most recent call last):
  File "venv/lib/python3.6/site-packages/aiohttp/client_reqrep.py", line 695, in start
    (message, payload) = await self._protocol.read()
  File "venv/lib/python3.6/site-packages/aiohttp/streams.py", line 533, in read
    await self._waiter
  File "venv/lib/python3.6/site-packages/aiohttp/client_proto.py", line 161, in data_received
    messages, upgraded, tail = self._parser.feed_data(data)
  File "aiohttp/_http_parser.pyx", line 297, in aiohttp._http_parser.HttpParser.feed_data
aiohttp.http_exceptions.BadHttpMessage: 400, message='invalid constant string'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "t.py", line 12, in <module>
    loop.run_until_complete(main())
  File "/Users/kostia/.pyenv/versions/3.6.4/lib/python3.6/asyncio/base_events.py", line 467, in run_until_complete
    return future.result()
  File "t.py", line 7, in main
    async with session.delete('http://proxy.crawlera.com:8010/session/foo') as response:
  File "venv/lib/python3.6/site-packages/aiohttp/client.py", line 783, in __aenter__
    self._resp = await self._coro
  File "venv/lib/python3.6/site-packages/aiohttp/client.py", line 333, in _request
    await resp.start(conn, read_until_eof)
  File "venv/lib/python3.6/site-packages/aiohttp/client_reqrep.py", line 700, in start
    message=exc.message, headers=exc.headers) from exc
aiohttp.client_exceptions.ClientResponseError: 400, message='invalid constant string'

response that is gives the error (repr(data) in client_proto.py, line 161) is

b'HTTP/1.1 401 Unauthorized\r\nConnection: close\r\nDate: Mon, 26 Mar 2018 11:15:21 GMT\r\nProxy-Connection: close\r\nTransfer-Encoding: chunked\r\nWWW-Authenticate: Basic realm="Crawlera"\r\nX-Crawlera-Error: bad_auth\r\n\r\n0\r\n\r\n0\r\n\r\n'

and actual response that I'd like to parse (can't provide a public repro for it, but it gives the same error) is

b'HTTP/1.1 204 No Content\r\nConnection: close\r\nDate: Mon, 26 Mar 2018 11:08:45 GMT\r\nProxy-Connection: close\r\nTransfer-Encoding: chunked\r\n\r\n0\r\n\r\n0\r\n\r\n'

AIOHTTP_NO_EXTENSIONS=1 also does not help, although the error is different:

Traceback (most recent call last):
  File "venv/lib/python3.6/site-packages/aiohttp/client_reqrep.py", line 695, in start
    (message, payload) = await self._protocol.read()
  File "venv/lib/python3.6/site-packages/aiohttp/streams.py", line 533, in read
    await self._waiter
  File "venv/lib/python3.6/site-packages/aiohttp/client_proto.py", line 162, in data_received
    messages, upgraded, tail = self._parser.feed_data(data)
  File "venv/lib/python3.6/site-packages/aiohttp/http_parser.py", line 142, in feed_data
    msg = self.parse_message(self._lines)
  File "venv/lib/python3.6/site-packages/aiohttp/http_parser.py", line 408, in parse_message
    raise BadStatusLine(line) from None
aiohttp.http_exceptions.BadStatusLine: 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "t.py", line 12, in <module>
    loop.run_until_complete(main())
  File "/Users/kostia/.pyenv/versions/3.6.4/lib/python3.6/asyncio/base_events.py", line 467, in run_until_complete
    return future.result()
  File "t.py", line 7, in main
    async with session.delete('http://proxy.crawlera.com:8010/session/foo') as response:
  File "venv/lib/python3.6/site-packages/aiohttp/client.py", line 783, in __aenter__
    self._resp = await self._coro
  File "venv/lib/python3.6/site-packages/aiohttp/client.py", line 333, in _request
    await resp.start(conn, read_until_eof)
  File "venv/lib/python3.6/site-packages/aiohttp/client_reqrep.py", line 700, in start
    message=exc.message, headers=exc.headers) from exc
aiohttp.client_exceptions.ClientResponseError: 400, message='Bad Request'

lopuhin avatar Mar 26 '18 11:03 lopuhin

I found that aiohttp behaved much better with Crawlera (at least with some sites) if I avoided the proxy_auth argument and explicitly entered my API key in the url. For example:

urllist = ['https://google.com', 'https://bing.com']
proxy_api = "54fuj567a43see7uedhd9498c45_APIstringfromCrawlera"
proxy_host = "proxy.crawlera.com"
proxy_port = "8010"

proxy = "http://{}@{}:{}/".format(proxy_api, proxy_host, proxy_port)

jobs = [asyncio.ensure_future(session.get(URL(url), ssl=False, proxy=proxy)) for url in urllist]

done_jobs = await asyncio.gather(*jobs)
for response in done_jobs:
    print(response.status, "status code for", response.url)

pl77 avatar Oct 24 '18 15:10 pl77

Hi all, Have any of you been able to fixes this issue ? I'm still having invalid character in header when requesting an endpoint using GET. When i set AIOHTTP_NO_EXTENSIONS=1 var, i get Invalid HTTP Header: X-XSS-Protection=1;.

I'm using python 3.7, on aiohttp-3.4.4.

Cheers

unl1k3ly avatar Nov 29 '18 16:11 unl1k3ly

@unl1k3ly but X-XSS-Protection=1; header is really invalid, isn't it?

asvetlov avatar Nov 29 '18 16:11 asvetlov

@asvetlov thanks for prompt reply mate. I'm not sure what you mean. The request works with curl and python requests module. With aiohttp i get that error as output. In fact, my endpoint returns that http header... would be an away to bypass this exception and finally print it's content ?

Cheers

unl1k3ly avatar Nov 29 '18 16:11 unl1k3ly

All im getting now is aiohttp.client_exceptions.ClientResponseError: 400, message='invalid character in header'. I'm running on aiohttp-3.4.4.

Cheers

unl1k3ly avatar Nov 30 '18 04:11 unl1k3ly

So, more updates on this... I've just tested with requests-futures and grequests and both seems to be returned the right content rather than raise an exception upon a response header. Is there a way to bypass this exception so aiohttp finishes the request @asvetlov.

Thank you for all support.

unl1k3ly avatar Nov 30 '18 13:11 unl1k3ly

If you want to modify a parser code to recover after invalid headers string -- a PR is welcome. I have no time and motivation to work on handling malformed headers but will review any improvement suggestion.

asvetlov avatar Nov 30 '18 14:11 asvetlov

So, it makes impossible for aiohttp server to process requests with http signatures

autogestion avatar Feb 11 '19 11:02 autogestion

I'm stuck on aiohttp 3.6.3 because with aiohttp 3.7 and 3.8 I get an invalid character in header exception. The AIOHTTP_NO_EXTENSIONS workaround did not solve the issue for me. I would really appreciate a way to recover from the error and still receive the response body.

ctg3 avatar Mar 01 '22 19:03 ctg3

response that is gives the error (repr(data) in client_proto.py, line 161) is

b'HTTP/1.1 401 Unauthorized\r\nConnection: close\r\nDate: Mon, 26 Mar 2018 11:15:21 GMT\r\nProxy-Connection: close\r\nTransfer-Encoding: chunked\r\nWWW-Authenticate: Basic realm="Crawlera"\r\nX-Crawlera-Error: bad_auth\r\n\r\n0\r\n\r\n0\r\n\r\n'

This is invalid, because a response should be finished after receiving a 0 length chunk: https://www.rfc-editor.org/rfc/rfc9112.html#section-7.1-3

i.e. \r\n0\r\n\r\n should be removed from the end of that response.

The sample in the original issue seems to be fine now.

Dreamsorcerer avatar Aug 05 '23 21:08 Dreamsorcerer