lwt icon indicating copy to clipboard operation
lwt copied to clipboard

`Lwt_unix.read` on Windows with non-blocking fd seems to block Lwt

Open gabelevi opened this issue 6 years ago • 7 comments

As a workaround to https://github.com/ocsigen/lwt/issues/569, I recently switched to using non-blocking fds. This worked on Linux and OSX but broke my Windows build. While I'm not sure the cause of this issue, here's what I know and how I worked around the issue. I have not attempted to put together a small repro case yet, because my Windows computer is a potato.

  • Compiler: 4.03.0+mingw64c
  • Lwt: 3.3.0
  • Lwt_ppx: 1.1.0

What my code does:

I have a server, written with Lwt.

  1. It reads an fd using Lwt_unix.read
  2. After the read, it pushes a message to an Lwt_stream.
  3. Another Lwt thread waits on Lwt_stream.next

Behavior with blocking fd

Server reads from fd and pushes to stream. The thread waiting on the stream wakes up and reads from the stream

Behavior with non-blocking fd

Server reads from fd and pushes to stream.

The thread waiting on the stream never wakes up.

If I use Lwt_engine.set, I can confirm that the select method is never being called, even though there are other fds being read.

My fix

If I run Lwt_unix.wait_read fd before the Lwt_unix.read, it fixes my problem.

My best guess

Lwt_unix.readable fd is often false before the Lwt_unix.read call. So my guess is that Lwt_unix.read is effectively a blocking call when using a non-blocking fd, and blocks all of Lwt until the fd is readable and the read call completes.

gabelevi avatar Apr 12 '18 16:04 gabelevi

This may have been fixed by https://github.com/ocsigen/lwt/commit/86a6baff880f60faffccad49425478deb06c277b, part of #569, which isn't part of a release yet.

The underlying problem was also spotted by @gabelevi, so thanks :)

aantron avatar Apr 13 '18 10:04 aantron

This may have been fixed by 86a6baf, part of #569, which isn't part of a release yet.

Unfortunately, that fix is for blocking fds. This issue is for non-blocking fds

gabelevi avatar Apr 16 '18 15:04 gabelevi

@gabelevi, on Windows, only sockets can be non-blocking (in the Unix sense). Among other problems, you may have a fd that Lwt thinks is non-blocking, but is actually blocking. What kind of fd do you have, and how are you creating the Lwt_unix.file_descr for it?

aantron avatar Apr 16 '18 16:04 aantron

@aantron , I think we're hitting a similiar issue in esy in this code path:

    let f ic = Lwt_io.read ic in
    Lwt_io.with_file ~mode:Lwt_io.Input path f

https://github.com/esy/esy/blob/571f1c28d15752ec960550abfd25eaa7b4ec8b80/esy-lib/Fs.ml#L14

The issue seems to be that Lwt_io.read eventually calls Lwt_unix.openfile with O_NONBLOCK. Lwt_unix.openfile calls Unix.openfile on Windows, which ignores the O_NONBLOCK setting (as you mentioned, files can't be non-blocking on Windows).

The Unix.openfile method is implemeneted here for Windows: https://github.com/ocaml/ocaml/blob/db9671f67b47a582b078f1a92e42edfd9f6a0fd7/otherlibs/win32unix/open.c#L44

....which is where O_NONBLOCK is ignored. But I think this is a problematic case for us.

cc @andreypopp

bryphe avatar Nov 05 '18 22:11 bryphe

@bryphe I don't think that the O_NONBLOCK is the problem:

  1. Lwt_io.with_file eventually does call Lwt_unix.openfile with O_NONBLOCK, and that flag is indeed meaningless (on any system for a regular file).

  2. However, for making the Lwt_unix.file_descr, Lwt's "view" of the fd, it calls Lwt_unix.of_unix_file_descr:

    https://github.com/ocsigen/lwt/blob/596058030a25202a4220fa650278db88043573c1/src/unix/lwt_unix.cppo.ml#L593-L595

    which is an alias

    https://github.com/ocsigen/lwt/blob/596058030a25202a4220fa650278db88043573c1/src/unix/lwt_unix.cppo.ml#L426

    for mk_ch:

    https://github.com/ocsigen/lwt/blob/596058030a25202a4220fa650278db88043573c1/src/unix/lwt_unix.cppo.ml#L335-L344

    which calls is_blocking to guess the blocking mode:

    https://github.com/ocsigen/lwt/blob/596058030a25202a4220fa650278db88043573c1/src/unix/lwt_unix.cppo.ml#L290-L312

    but since this is Win32 and the file is not a socket (presumably, it is a HANDLE underneath, not SOCKET), and the caller did not supply ~blocking, is_blocking correctly decides that the file is opened in blocking mode, which should cause Lwt to run I/O requests on it in worker threads (to avoid blocking the main thread).

So, something else must be the problem.

The way I would debug this is by inserting prints into both the "user" code and Lwt (as needed), to see what is actually being triggered and what the various values are. I presume you are doing something like this on your end right now. I can also help out, but to resolve this kind of issue, I need to be able to reproduce it, whether in a small case, or if you can provide instructions on what full projects I should build, in what environment, etc.

aantron avatar Nov 06 '18 23:11 aantron

Ah ya, thanks for the details @aantron - you're right, given that, the issue we faced with esy/esy#539 does not seem to be the same issue here. It sounds like in this case, too, the entire thread is blocked, which is not what we're experiencing here. We still haven't been able to show conclusively that Lwt is to blame at the moment either - I think what would useful for us to see in our repro is if the close method is being called successfully, and at that point, see if there are still any open handles on it.

bryphe avatar Nov 07 '18 02:11 bryphe

You may be able to get some information by attaching a finalizer to the Lwt_io channel (i.e., the argument Lwt_io.with_file passes to your function) using Gc.finalise, and triggering a garbage collection with Gc.full_major when the with_file promise resolves. See https://caml.inria.fr/pub/docs/manual-ocaml/libref/Gc.html.

If the finaliser is called, at least we know nothing is keeping a reference to the Lwt file descriptor, which is what we would expect.

That doesn't mean the underlying OS fd is closed, or the file isn't open through another fd elsewhere, but it does give some info for relatively low effort.

aantron avatar Nov 07 '18 03:11 aantron