use os.sendfile() when available
The StaticFileHandler should use os.sendfile() if available. While static files would commonly be sent by an external webserver (nginx etc.) in a production environment, sometimes files are generated on the fly and need to be written out through the IOLoop. sendfile() would reduce the overhead here.
I'd also like to have a general high-level RequestHandler.send_file(file_path, content_type=None) method for this case that the StaticFileHandler would call, but that other user code could conveniently use as well. If sendfile() is not available, it would simply fall back to forwarding chunks, but user code wouldn't have to care any more how exactly it works on the current system.
How much time does your profiling data say you're spending in this part of the code?
It wastes both time and memory. Sending files is something that can be done very efficiently with sendfile (and in fact directly in the IO loop driven by "socket free to send" events), whereas application level code that does this efficiently is quite tricky to write, so the normal way people would do it is to read the entire file into memory in chunks and call write() for each chunk, and then letting the IO loop send everything out afterwards. I'm therefore really more concerned about the memory impact than the time impact. And implementing something that reads the next chunk from the file when the socket says that it can send it is really low-level and involved and not something that an application should have to deal with. Plus, it could still block reading the file.
With sendfile on a non-blocking socket, the IO loop can tell the socket to send out the next bytes whenever it's ready to send, and the socket will tell the IO loop when it's done. Meaning that the file will be sent almost entirely without impacting the rest of the Python process, neither CPU-wise nor memory-wise.
Here is an example where sendfile() is used on a non-blocking socket: https://code.google.com/p/pyftpdlib/source/browse/tags/release-0.7.0/pyftpdlib/ftpserver.py#1035
When sendfile can be used, it is undoubtedly better than the alternatives in several dimensions. But it cannot be used in all cases, so the non-sendfile version must be kept as well. So the question is a cost/benefit tradeoff: is the benefit of sendfile large enough to justify the cost of a second implementation (and the associated difficulties in testing and ensuring that both implementations are equivalent)?
The loop in StaticFileHandler is already fairly simple and memory-efficient (note that content is an iterator, not a list): it only reads one chunk more than what will fit in the socket buffer.
Just wanted to mention this, but I'm sure you're already aware. But in Python 3.5, there's now socket.socket.sendfile() (which uses os.sendfile()).
I ended up here by accident. Just wanted to point out that using sendfile() makes an actual difference. In pyftpdlib (I'm the author) transfers on localhost by using sendfile() vs. send() are nearly 3 times as fast so I think such an addition to Tornado should be seriously taken into consideration. Famous softwares such as proftpd, vsftpd, nginx and samba all implement sendfile support.
- On Python 2.7 you can use my backport of the 3.X functionality: https://github.com/giampaolo/pysendfile
- On Python 3.3+ you have os.sendfile() as already mentioned.
- socket.socket.sendfile() (I contributed that patch) is useless in this case because it cannot be used with non-blocking sockets.
As for how to integrate this into Tornado IO loop you can take a look at how this is done in pyftpdlib: https://github.com/giampaolo/pyftpdlib/blob/6165a40fd4f7c7cd2d549dbb816b3208091ffdff/pyftpdlib/handlers.py#L658 Roughly:
- every time the socket is ready to be written you call
sendfile()instead ofsocket.send(you'll never dofile.read()) sendfileneeds 2 fds (socket fd and file fd), the file offset and the max number of bytes to send on each call; here is the function signaturesendfilereturns how many bytes were written; you have to keep track of that so that you can pass an updated "offset" parameter on each call- sendfile() returns
0when EOF is reached (meaning you're done) - in case the file is not a "regular" file, sendfile() will fail with EINVAL; you have to take that into account when you call it the first time: if it fails with EINVAL you can fallback on using the plain send() method / logic
- sendfile will raise exceptions similar to send() (EAGAIN, ECONNRESET etc), see here
Last: I am thinking about rewriting pyftpdlib by using Tornado because it provides an HTTP client, coroutines and support for easily integrating threads/processes to avoid blocking calls. As such, I may be able to contribute a sendfile() patch for Tornado, but I'm not sure when as it's a lot of work (pyftpdlib side).
Any News?
Any latest news on this? Any way we can use os.sendfile with some http server in Python?
I think this is unlikely to happen so I'm going to go ahead and close this issue as "unplanned".
Now that we're unconditionally wrapping the asyncio event loop, the functionality is available at the lowest level via that interface. But at the HTTP level, supporting sendfile would involve breaking a lot of abstractions and would only work in limited circumstances (no TLS, etc). I don't think this is a good fit for Tornado.
That leaves IOStream, where I think there is some chance that a sendfile interface might make sense. I might be willing to accept a PR to add this. But overall my recommendation would be that if you care about sendfile you're probably better off working with asyncio protocols and transports rather than IOStream (which is designed to do a lot of buffering).