[Python][FlightRPC] Feature request: Add some sort of fork protection to PyArrow Flight
Describe the enhancement requested
GRPC (and by extension FlightRPC) isn't fork-safe:
gRPC Python wraps gRPC core, which uses multithreading for performance, and hence doesn't support fork() [Source]
While it may be generally expected for library users to consider thread/fork safety while using such mechanisms, Python's multiprocessing package makes it very easy for users who are unfamiliar with these concepts to run into trouble. In the case of FlightRPC, users can get intro trouble when they fork after instantiating a FlightClient in a failed attempt to increase performance when they should fork before importing and using pyarrow.flight or probably consider another approach. When operators of FlightRPC servers encounter users making this type of mistake, debugging can be hard because the errors seen on the server may make little sense since the clients are essentially broken.
To improve the experience for Flight adopters, I suggest considering either or both of:
- Improved documentation about the use of Flight and
multiprocessingor similar libraries - Build some sort of explicit fork prevention into Flight
There are a couple of different concerns here:
- Where to put such a prevention: I think it makes more sense to put this into PyArrow Flight rather than C++ so that not all implementations need to pay any cost associated with it. Plus, it's a somewhat common pattern in the Python ecosystem to protect users from common mistakes whereas it's not so much in C++.
- How it could be done: Python has a mechanism to register fork handlers but I don't think fork is the only way to run into trouble so I think a PID-check of sorts similar to how fsspec does it could be built into FlightClient and would let us offer the highest level of prevention.
I'm curious if others think this is a good idea or not and if anyone has thoughts on the approach.
Component(s)
FlightRPC, Python
I'm pinging a few folks I think it would be good to get feedback from: @pitrou, @jorisvandenbossche, @zeroshade, @lidavidm. Anyone else is also encouraged to chime in too.
It would be good to find out exactly what doesn't work:
- creating a Flight client, forking, and using the client from the parent process?
- creating a Flight client, forking, and using the client from the child process?
- something else
Thanks @pitrou, I can work that up and report back.
I think a PID check or similar in the Python bindings would be sufficient.
In the case of FlightRPC, users can get intro trouble when they fork after instantiating a FlightClient...
Does the same logic hold for a FlightServer instance too? Seems like it would do.
Yes, it should also hold for FlightServer, @nph. Do you have a use case where it'd be useful to have a similar check in FlightServer?
Thanks @amoeba. I do have a use case for which a similar check in FlightServer would be useful. I have a FlightServer that serves batches of features to the client for ML training/inference tasks. The data sampling (from Arrow datasets) and feature computation on the server can take advantage of multiprocessing. Happy to discuss more.
So what would you say the check would help you with in this case?
The idea of adding fork prevention in FlightClient is that the user would get a warning or error and would then they would know to rewrite their code. This is relevant for adopters of Flight because they might not control the code that gets written on the client side and this check would help save them time debugging and educating their users.
Is there a parallel for the server side? If you have some general questions about building scalable services on top of Arrow and Flight, we can discuss that elsewhere.
That all makes sense - thanks for clarifying @amoeba. I can't think of an obvious parallel on the server side but if the check is relatively lightweight then it seems like it might be worthwhile adding it to FlightServer? Or perhaps just improving the docs as per your original suggestion above. Cheers.
I think maybe for the scope of this ticket, adding a PID-check just to FlightClient and improving the docs would be good start. If we find others are interested in having the same check on FlightServer, it would be pretty easy to re-use the implementation here. I'll let others chime in too.
If you have any questions about scaling Flight or find the project is lacking documentation you wish it had, feel free to open another issue or reach out on the mailing list (See https://arrow.apache.org/community/).
Sounds good 👍
A PID check won't tell you anything in the parent, so if that's what you're looking for, you can probably register atfork handlers.
Hi @pitrou, I did some testing so we're clear what actually breaks and how. I've pushed four separate reproducers to a branch on my fork at https://github.com/amoeba/arrow/tree/gh-38617/gh-38617 (see README in that folder).
I tested using os.fork directly and also multiprocessing.Pool (the latter was how this issue was originally encountered) and the summary of what I found is:
- Forking then using a FlightClient from the parent process seems to work fine
- Other combinations do not:
- Forking then using a FlightClient from the child process: The RPC never appears to run, the server never appears to see it.
- Forking then using the FlightClient from both the parent and the child: Different failures happen including segfault, sigbus, hangs, cryptic GRPC errors
- Forking using
multiprocesessing.Pool, giving the workers access to the FlightClient via global (since we can't pickle): No child calls to the RPC ever appear to start, they just hang.
Let me know if other tests would be useful and if the ones I've done make sense.
Ok, so any situation where the Flight component is used from a process which is not the process used to load it initially should be flagged, right? Something like "remember the PID at DLL load time or module import time, then return an error if trying to do something from another PID"?
Yeah, I think that's the situation we want to cover here.
I was originally thinking this could just be built into FlightClient and not the entire DLL/module: Save the PID when we create a FlightClient, do a PID-check either on every RPC or in a background thread like it appears fsspec does, and raise an Exception when the PID doesn't match. My reasoning for putting the check just in FlightClient was to limit the impact to just where we expect the check to be useful since we the two methods have overhead. Would remembering the PID at DLL load or module import work better?
When you say "Forking then using a FlightClient from the child process" above, was the FlightClient created in the parent before forking or in the child?
Also, what happens if you create a first Flight client in the parent before forking, then create another one from child?
When you say "Forking then using a FlightClient from the child process" above, was the FlightClient created in the parent before forking or in the child?
Yeah, I should've made this clear. In these tests the client was created in the parent prior to forking.
Also, what happens if you create a first Flight client in the parent before forking, then create another one from child?
I ran this a bunch of times and see different behaviors at random, similar to what I saw above using a single client from both the parent and child:
- Client hangs, server doesn't see any RPCs, eventually fails with
pyarrow._flight.FlightUnavailableError: Flight returned unavailable error, with message: failed to connect to all addresses; last error: UNKNOWN: ipv4:0.0.0.0:8815: tcp handshaker shutdown - Client hangs after server sees one RPC, I eventually interrupt it
- Client runs and exits cleanly, only one RPC (from the parent) gets called
- Client process segfaults after the server sees one RPC. I think this only happens if I don't restart the server between test runs.
Right, so it seems we should store a global PID at the DLL level, since even a client created in a forked child isn't safe to use.