cheroot icon indicating copy to clipboard operation
cheroot copied to clipboard

socket.timeout: The write operation timed out

Open aniljava opened this issue 7 years ago • 26 comments

Copy of cherrypy #1575 https://github.com/cherrypy/cherrypy/issues/1575

  • I'm submitting a ... [x] bug report [ ] feature request [ ] question about the decisions made in the repository

  • Do you want to request a feature or report a bug? bug

  • What is the current behavior? Several socket.timeout stack traces on error logs. No documentation to address this issue.

  • If the current behavior is a bug, please provide the steps to reproduce and if possible a screenshots and logs of the problem.

Lots of following trace in error logs.

[10/Mar/2017:11:13:44] ENGINE socket.error 'The write operation timed out'
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/cheroot/server.py", line 960, in communicate
    req.respond()
  File "/usr/local/lib/python3.5/dist-packages/cheroot/server.py", line 786, in respond
    self.server.gateway(self).respond()
  File "/usr/local/lib/python3.5/dist-packages/cheroot/wsgi.py", line 101, in respond
    self.write(chunk)
  File "/usr/local/lib/python3.5/dist-packages/cheroot/wsgi.py", line 183, in write
    self.req.write(chunk)
  File "/usr/local/lib/python3.5/dist-packages/cheroot/server.py", line 839, in write
    self.conn.wfile.write(chunk)
  File "/usr/local/lib/python3.5/dist-packages/cheroot/makefile.py", line 26, in write
    self._flush_unlocked()
  File "/usr/local/lib/python3.5/dist-packages/cheroot/makefile.py", line 35, in _flush_unlocked
    n = self.raw.write(bytes(self._write_buf))
  File "/usr/lib/python3.5/socket.py", line 593, in write
    return self._sock.send(b)
  File "/usr/lib/python3.5/ssl.py", line 861, in send
    return self._sslobj.write(data)
  File "/usr/lib/python3.5/ssl.py", line 586, in write
    return self._sslobj.write(data)
socket.timeout: The write operation timed out
  • What is the expected behavior?

If fatal or of concern, guide to address if not, ability to silence.

  • What is the motivation / use case for changing the behavior?

Need a guide that google can pick up for any one else having similar issues.

  • Please tell us about your environment:
  • Python version: Python 3.5.2
  • OS: Linux n2 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
  • Browser: all

aniljava avatar Mar 12 '17 17:03 aniljava

Under what conditions does this occur? Could it be that the remote end has just stopped receiving data? Maybe someone's cell phone went into a train tunnel. Maybe the user closed their browser right after issuing the request. I don't think it's possible from the server end to provide much more detail without knowing more about the application and the environment in which it's running.

Much better would be if you could devise a scenario where we can replicate the environment that triggers the error. Can you develop a minimal example that does just that? I understand that often it's difficult or impossible to devise such an example. In that case, your best bet is going to be to use a tool like wireshark or similar to investigate and characterize the activity that leads to this error.

If someone can suggest a concrete way to improve the error message or otherwise provide better context under this condition, I'm open to suggestions, but as stated, this ticket doesn't feel actionable to me.

jaraco avatar Mar 25 '17 14:03 jaraco

Since I'm seeing the same error report, I thought I would add a comment. I don't expect a solution, because I suspect the problem lies in the code calling the cheroot server in the first place, but I can add at least one observation.

Well, it's not because the end user went into a tunnel and lost their connection. :-) Unfortunately, my setup is way too complicated to prepare a simple small reproducible example, though it IS reproducible in my complex setup.

Again, I don't expect a solution here. I have a rather complicated data server running within a docker container serving up images controlled by a javascript widget in the browser. The server authors did not originally set up their CherryPy to run in HTTPS mode out of the box. We modified the configuration so that it would and it's worked until now - when we try to view images. I suspect that in their testing, the authors never tested this particular image viewer plugin against an HTTPS setup. That's their (and my) problem, not cheroot's.

Perhaps the original poster is operating in a similar "designed for http but hacked to run under https" environment.

It does occur to me though that I could track the cause of the problem if server.py passed along the error onto the calling code. Maybe it is, for all I know, and the authors of my server neglected to print that information in the log files they create and which I'm reading to see the error. Bad on them!

HDVUCAIR avatar Jul 21 '17 22:07 HDVUCAIR

@HDVUCAIR thanks for the info. If you ever manage to troubleshoot this, feel free to submit a PR or post a trace.

webknjaz avatar Jul 22 '17 13:07 webknjaz

When using proxy_buffering in nginx this can happen (on by default). proxy_max_temp_file_size defaults to 1Gb, so if the client is slow and the server is fast then nginx will buffer to disk. If the buffer exceeds 1Gb then the socket is closed (I don't know exactly why). With proxy_buffering off the proxying is synchronous and you won't see this. Would be nice if nginx could swap to synchronous mode once the max temp file size is reached, but that does not seem to be the current behaviour.

http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_max_temp_file_size

grove avatar Mar 28 '18 20:03 grove

@grove any chance to get a Dockerfile with reproducible example?

webknjaz avatar Apr 04 '18 08:04 webknjaz

Hi, Am running my jneter tests using tuarus YML file and am getting followi errors in console 22:28:10 WARNING: Failed to send data, will retry in 30.0 sec... 22:29:08 ERROR: Fatal error sending data: Traceback (most recent call last): File "C:\Program Files\Python35\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen chunked=chunked) File "C:\Program Files\Python35\lib\site-packages\urllib3\connectionpool.py", line 354, in _make_request conn.request(method, url, **httplib_request_kw) File "C:\Program Files\Python35\lib\http\client.py", line 1107, in request self._send_request(method, url, body, headers) File "C:\Program Files\Python35\lib\http\client.py", line 1152, in _send_request self.endheaders(body) File "C:\Program Files\Python35\lib\http\client.py", line 1103, in endheaders self._send_output(message_body) File "C:\Program Files\Python35\lib\http\client.py", line 936, in _send_output self.send(message_body) File "C:\Program Files\Python35\lib\http\client.py", line 908, in send self.sock.sendall(data) File "C:\Program Files\Python35\lib\ssl.py", line 899, in sendall v = self.send(data[count:]) File "C:\Program Files\Python35\lib\ssl.py", line 869, in send return self._sslobj.write(data) File "C:\Program Files\Python35\lib\ssl.py", line 594, in write return self._sslobj.write(data) socket.timeout: The write operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Program Files\Python35\lib\site-packages\requests\adapters.py", line 445, in send timeout=timeout File "C:\Program Files\Python35\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen _stacktrace=sys.exc_info()[2]) File "C:\Program Files\Python35\lib\site-packages\urllib3\util\retry.py", line 367, in increment raise six.reraise(type(error), error, _stacktrace) File "C:\Program Files\Python35\lib\site-packages\urllib3\packages\six.py", line 685, in reraise raise value.with_traceback(tb) File "C:\Program Files\Python35\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen chunked=chunked) File "C:\Program Files\Python35\lib\site-packages\urllib3\connectionpool.py", line 354, in _make_request conn.request(method, url, **httplib_request_kw) File "C:\Program Files\Python35\lib\http\client.py", line 1107, in request self._send_request(method, url, body, headers) File "C:\Program Files\Python35\lib\http\client.py", line 1152, in _send_request self.endheaders(body) File "C:\Program Files\Python35\lib\http\client.py", line 1103, in endheaders self._send_output(message_body) File "C:\Program Files\Python35\lib\http\client.py", line 936, in _send_output self.send(message_body) File "C:\Program Files\Python35\lib\http\client.py", line 908, in send self.sock.sendall(data) File "C:\Program Files\Python35\lib\ssl.py", line 899, in sendall v = self.send(data[count:]) File "C:\Program Files\Python35\lib\ssl.py", line 869, in send return self._sslobj.write(data) File "C:\Program Files\Python35\lib\ssl.py", line 594, in write return self._sslobj.write(data) urllib3.exceptions.ProtocolError: ('Connection aborted.', timeout('The write operation timed out',))

Can someone help out fir resolving this issue

Vinayak123Kalyani avatar Aug 16 '18 07:08 Vinayak123Kalyani

It's irrelevant to this issue and to Cheroot/CherryPy.

webknjaz avatar Aug 16 '18 07:08 webknjaz

I'll keep this open for now, so someone could have a chance to reproduce it with nginx scenario.

webknjaz avatar Aug 16 '18 08:08 webknjaz

My team has seen this with respect to file downloads > 1GB, nginx reverse proxy, nginx proxy_buffering, and the Flask framework instructing the hosting webserver (cherrypy) to stream the file.

For the present, we decided to turn off nginx proxy_buffering. In our case, the physical server is relatively big. That is, 40 cpu cores, 384GB RAM and lots of storage. CherryPy is being leveraged to host the Flask app that servers the APIs. Since downloads and API calls are not the main workload of the server, it was less of an effort to disable nginx proxy_buffering than it was to debug the issue.

jeffvanvoorst avatar Sep 03 '18 20:09 jeffvanvoorst

@jeffvanvoorst any chance to get a minimum reproducible example in a container?

webknjaz avatar Sep 03 '18 20:09 webknjaz

This is the smallest (hopefully reproducible) case I have come up with (Win10, Py 3.8.5, CP 18.5.0):

cherrypy_timeout_error.zip

It's basic https server serving static file from the subdirectory static. If you try to download the file with rate limited curl, then after 10 seconds the error appears and transfer stops.

Error appears to happen only over HTTPS and only when transferring larger files (>10s of MBs).

#! python
import os.path
import cherrypy

class Root:
	@cherrypy.expose
	def index(self):
		return """
			curl -k --limit-rate 2k https://127.0.0.1:8080/static/50.bin > 50.bin		
		"""

if __name__ == '__main__':
	if not os.path.exists('static/50.bin'):
		with open('static/50.bin', 'wb') as bin_file:
			bin_file.write(b"0" * 50 * 1024 * 1024)

	current_dir = os.path.dirname(os.path.abspath(__file__))
	# Set up site-wide config first so we get a log if errors occur.
	cherrypy.config.update({
		'environment': 'production',
		'log.error_file': 'site.log',
		'log.screen': True,

		'server.ssl_module': 'builtin',
		'server.ssl_certificate': os.path.join(current_dir, 'localhost.cert'),
		'server.ssl_private_key': os.path.join(current_dir, 'localhost.key'),
	})

	conf = {
		'/static': {
			'tools.gzip.on': False,
			'tools.staticdir.on': True,
			'tools.staticdir.dir': os.path.join(current_dir, 'static'),
		}
	}
	cherrypy.quickstart(Root(), '/', config=conf)

And the download command is:

$ curl -k --limit-rate 2k https://127.0.0.1:8080/static/50.bin > 50.bin

altermark avatar Sep 18 '20 17:09 altermark

If you try to download the file with rate limited curl, then after 10 seconds the error appears and transfer stops.

Okay, I can see this traceback appearing after 10s but the transfer does not stop. curl keeps writing the downloaded chunks with data, even after I shutdown the process with CherryPy. I suppose curl does some caching on its own...

webknjaz avatar Sep 20 '20 15:09 webknjaz

I see. curl reads 200KB in 4 seconds and then reads nothing from the connection until average speed drops down to 2048 and then continues reading from server.

When large transfers from our production server were interrupted after ten seconds, this traceback was the only thing that appeared in logs. I increased the socket timeout and "time to interruption" increased as well. Since the connection does not break with the reproducer, there must be some other thing at play in our production as well but I have not been able to reproduce the disconnect, only the error message. I should have not mention the disconnect in relation to reproducer. My mistake.

altermark avatar Sep 20 '20 19:09 altermark

Since the connection does not break with the reproducer, there must be some other thing at play

Well, I wouldn't be so sure. It may be some curl bug: it kept running after the server was dead (there couldn't have been any connection live after that).

webknjaz avatar Sep 20 '20 19:09 webknjaz

@altermark it'd be great if you could eliminate CherryPy from the reproducer and make one that only relies on Cheroot...

Also, could you post your traceback? According to the topic starter, the socket.timeout exception is supposed to pop-up @ https://github.com/cherrypy/cheroot/blob/7268068/cheroot/server.py#L1283. But it's already wrapped with try/except and catches socket.error. socket.timeout is a subclass of socket.error so it should get caught too. I'm confused because there's no re-raising in the code...

webknjaz avatar Sep 20 '20 19:09 webknjaz

Oh, it goes ahead and attempts to send a response if there was a timeout by calling _conditional_error() and that is unprotected.

FWIW changing if errnum in timeout_errs: to if 'write' not in str(ex) and errnum in timeout_errs: should be a quick patch. But we need a pytest-based regression test for this anyway.

webknjaz avatar Sep 20 '20 20:09 webknjaz

I guess doing something like https://stackoverflow.com/questions/1833563/simple-way-to-simulate-a-slow-network-in-python should be a good start...

webknjaz avatar Sep 20 '20 20:09 webknjaz

To reproduce the error message, the following code should be enough:

import requests
import threading
import time

from cheroot.wsgi import Server
from cheroot.ssl.builtin import BuiltinSSLAdapter

def counter():
	i = 0
	while True:
		i += 1
		row = b"%i " % i
		yield row

def app(environ, start_fn):
	start_fn('200 OK', [('Content-Type', 'application/octet-stream')])
	return counter()

def test_small():
	server = Server(bind_addr=("127.0.0.1", 8080), wsgi_app=app)
	server.ssl_adapter = BuiltinSSLAdapter(
		certificate='localhost.cert',
		private_key='localhost.key',
	)
	server.timeout = 1

	server_thread = threading.Thread(target=server.safe_start)
	server_thread.start()
	while not server.ready:
		time.sleep(0.1)

	r = requests.get('https://127.0.0.1:8080/', stream=True, verify=False)

	# sleep so that socket times out
	time.sleep(2)

	server.stop()
	server_thread.join()

if __name__ == "__main__":
	test_small()

You don't even need to read anything. When I tried to read data, then the connection error appeared only after I've read all data from the buffer which seems to be (on my PC) at least 490KB.

	# sleep so that socket times out
	# if you comment this sleep out, it reads all data without error
	time.sleep(3)

	try:
		data = r.raw.read(2_000_000)
		print(">", len(data))
	except Exception as err:
		print("Error reading data:", err)
	finally:
		server.stop()
		server_thread.join()

And the error was:

Error reading data: ('Connection broken: IncompleteRead(490656 bytes read)', IncompleteRead(490656 bytes read))

But because I have no control of the speed the buffer fills, sometimes I had to read more.

altermark avatar Sep 20 '20 23:09 altermark

You can even stop the server after sleep and then read data from the buffer:

	time.sleep(3)

	server.stop()
	server_thread.join()

	try:
		data = r.raw.read(2_000_000)
		print(">", len(data))
	except Exception as err:
		print("Error reading data:", err)

altermark avatar Sep 20 '20 23:09 altermark

Thanks for the investigation! The next step would be to create a PR with tests.

webknjaz avatar Sep 21 '20 13:09 webknjaz

I have a similar (same?) issue with gzip encoding. Steps to repro:

  1. Serve a large file
  2. Content-Length header is determined before actually sending out the data in _cprequest
  3. Step 2 actually forces the entire response to be created in memory
  4. I suspect then when SSL timeout is issued it is trying to write the entire response with 1 write/send call and this times out after the 10 second timeout.

ryanrasti avatar Dec 06 '20 22:12 ryanrasti

I am struggling to understand this issue. Is it fixed or not? Does anyone know how to fix it?

OneTouchOfME avatar Dec 12 '22 14:12 OneTouchOfME

Oh, it goes ahead and attempts to send a response if there was a timeout by calling _conditional_error() and that is unprotected.

FWIW changing if errnum in timeout_errs: to if 'write' not in str(ex) and errnum in timeout_errs: should be a quick patch. But we need a pytest-based regression test for this anyway.

@webknjaz Your one-line fix doesn't actually fix it, because then the elif also does a _conditional_error()

        except socket.error as ex:
            errnum = ex.args[0]
            # sadly SSL sockets return a different (longer) time out string
            timeout_errs = 'timed out', 'The read operation timed out'
            if errnum in timeout_errs:
                # Don't error if we're between requests; only error
                # if 1) no request has been started at all, or 2) we're
                # in the middle of a request.
                # See https://github.com/cherrypy/cherrypy/issues/853
                if (not request_seen) or (req and req.started_request):
                    self._conditional_error(req, '408 Request Timeout')
            elif errnum not in errors.socket_errors_to_ignore:
                self.server.error_log(
                    'socket.error %s' % repr(errnum),
                    level=logging.WARNING, traceback=True,
                )
                self._conditional_error(req, '500 Internal Server Error')

The question is what should happen when the error is 'The write operation timed out'. Should we skip the 408 error case and add this error to errors.socket_errors_to_ignore so it skips the 500 error case?

Who needs to be made aware that there is a real problem with cheroot server crashing, and it's easily reproducible?

gfwellman avatar Jul 21 '23 20:07 gfwellman

Should we skip the 408 error case and add this error to errors.socket_errors_to_ignore so it skips the 500 error case?

I guess not, since that results in the application dying without a stack trace instead of with one.

gfwellman avatar Jul 21 '23 20:07 gfwellman

I guess not, since that results in the application dying without a stack trace instead of with one.

That was a mistake in my interpretation of events. The application was dying without a stack trace for an almost unrelated reason (it ran out of memory due to the large responses I was creating to trigger this problem)

It now appears that adding 'The write operation timed out' to timeout_errs (and thus responding 408) will fix the problem. That should probably also be added to errors.socket_errors_to_ignore.

(Further, this means that the unprotected calls to _conditional_error were never the problem. It was crashing if it went into the elif to return a 500, but doesn't crash if it goes into the case before that to return a 408. So I guess it was the unprotected call to self.server.error_log? I'll have to double-check that when I'm at work next week.)

gfwellman avatar Jul 22 '23 20:07 gfwellman

One more comment in case anyone runs into this error for the same reason I did. Yes, my proposed fix above will prevent Cheroot from crashing, but if 'The write operation timed out' is happening for a good reason, ignoring it will result in returning http 200, but with a truncated response. If you're experiencing this with Cherrypy when returning large responses, you probably just need to increase the server.socket_timeout value in the configuration file that you give to Cherrypy on startup.

gfwellman avatar Jul 29 '23 06:07 gfwellman