lwt
lwt copied to clipboard
`Lwt_unix.read` on Windows with non-blocking fd seems to block Lwt
As a workaround to https://github.com/ocsigen/lwt/issues/569, I recently switched to using non-blocking fds. This worked on Linux and OSX but broke my Windows build. While I'm not sure the cause of this issue, here's what I know and how I worked around the issue. I have not attempted to put together a small repro case yet, because my Windows computer is a potato.
- Compiler: 4.03.0+mingw64c
- Lwt: 3.3.0
- Lwt_ppx: 1.1.0
What my code does:
I have a server, written with Lwt.
- It reads an fd using
Lwt_unix.read
- After the read, it pushes a message to an
Lwt_stream
. - Another Lwt thread waits on
Lwt_stream.next
Behavior with blocking fd
Server reads from fd and pushes to stream. The thread waiting on the stream wakes up and reads from the stream
Behavior with non-blocking fd
Server reads from fd and pushes to stream.
The thread waiting on the stream never wakes up.
If I use Lwt_engine.set
, I can confirm that the select
method is never being called, even though there are other fds being read.
My fix
If I run Lwt_unix.wait_read fd
before the Lwt_unix.read
, it fixes my problem.
My best guess
Lwt_unix.readable fd
is often false before the Lwt_unix.read
call. So my guess is that Lwt_unix.read
is effectively a blocking call when using a non-blocking fd, and blocks all of Lwt until the fd is readable and the read
call completes.
This may have been fixed by https://github.com/ocsigen/lwt/commit/86a6baff880f60faffccad49425478deb06c277b, part of #569, which isn't part of a release yet.
The underlying problem was also spotted by @gabelevi, so thanks :)
This may have been fixed by 86a6baf, part of #569, which isn't part of a release yet.
Unfortunately, that fix is for blocking fds. This issue is for non-blocking fds
@gabelevi, on Windows, only sockets can be non-blocking (in the Unix sense). Among other problems, you may have a fd that Lwt thinks is non-blocking, but is actually blocking. What kind of fd do you have, and how are you creating the Lwt_unix.file_descr
for it?
@aantron , I think we're hitting a similiar issue in esy
in this code path:
let f ic = Lwt_io.read ic in
Lwt_io.with_file ~mode:Lwt_io.Input path f
https://github.com/esy/esy/blob/571f1c28d15752ec960550abfd25eaa7b4ec8b80/esy-lib/Fs.ml#L14
The issue seems to be that Lwt_io.read
eventually calls Lwt_unix.openfile
with O_NONBLOCK
. Lwt_unix.openfile
calls Unix.openfile
on Windows, which ignores the O_NONBLOCK
setting (as you mentioned, files can't be non-blocking on Windows).
The Unix.openfile
method is implemeneted here for Windows: https://github.com/ocaml/ocaml/blob/db9671f67b47a582b078f1a92e42edfd9f6a0fd7/otherlibs/win32unix/open.c#L44
....which is where O_NONBLOCK
is ignored. But I think this is a problematic case for us.
cc @andreypopp
@bryphe I don't think that the O_NONBLOCK
is the problem:
-
Lwt_io.with_file
eventually does callLwt_unix.openfile
withO_NONBLOCK
, and that flag is indeed meaningless (on any system for a regular file). -
However, for making the
Lwt_unix.file_descr
, Lwt's "view" of the fd, it callsLwt_unix.of_unix_file_descr
:https://github.com/ocsigen/lwt/blob/596058030a25202a4220fa650278db88043573c1/src/unix/lwt_unix.cppo.ml#L593-L595
which is an alias
https://github.com/ocsigen/lwt/blob/596058030a25202a4220fa650278db88043573c1/src/unix/lwt_unix.cppo.ml#L426
for
mk_ch
:https://github.com/ocsigen/lwt/blob/596058030a25202a4220fa650278db88043573c1/src/unix/lwt_unix.cppo.ml#L335-L344
which calls
is_blocking
to guess the blocking mode:https://github.com/ocsigen/lwt/blob/596058030a25202a4220fa650278db88043573c1/src/unix/lwt_unix.cppo.ml#L290-L312
but since this is Win32 and the file is not a socket (presumably, it is a
HANDLE
underneath, notSOCKET
), and the caller did not supply~blocking
,is_blocking
correctly decides that the file is opened in blocking mode, which should cause Lwt to run I/O requests on it in worker threads (to avoid blocking the main thread).
So, something else must be the problem.
The way I would debug this is by inserting prints into both the "user" code and Lwt (as needed), to see what is actually being triggered and what the various values are. I presume you are doing something like this on your end right now. I can also help out, but to resolve this kind of issue, I need to be able to reproduce it, whether in a small case, or if you can provide instructions on what full projects I should build, in what environment, etc.
Ah ya, thanks for the details @aantron - you're right, given that, the issue we faced with esy/esy#539 does not seem to be the same issue here. It sounds like in this case, too, the entire thread is blocked, which is not what we're experiencing here. We still haven't been able to show conclusively that Lwt
is to blame at the moment either - I think what would useful for us to see in our repro is if the close
method is being called successfully, and at that point, see if there are still any open handles on it.
You may be able to get some information by attaching a finalizer to the Lwt_io
channel (i.e., the argument Lwt_io.with_file
passes to your function) using Gc.finalise
, and triggering a garbage collection with Gc.full_major
when the with_file
promise resolves. See https://caml.inria.fr/pub/docs/manual-ocaml/libref/Gc.html.
If the finaliser is called, at least we know nothing is keeping a reference to the Lwt file descriptor, which is what we would expect.
That doesn't mean the underlying OS fd is closed, or the file isn't open through another fd elsewhere, but it does give some info for relatively low effort.