modal-client
modal-client copied to clipboard
Verify state of TCP sockets when creating memory snapshots
A program with open TCP connections will emit a clear error and indicate the connections open. Open connections prevent memory snapshots from succeeding. Adding this check allows for us to turn on networking during memory snapshots.
[Thread-1 (thread_inner)] 2024-07-23T21:09:55+0000 Remote Address: 94.140.14.14:80, Status: ESTABLISHED
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/pkg/modal/_container_entrypoint.py", line 862, in <module>
main(container_args, client)
File "/usr/local/lib/python3.11/contextlib.py", line 222, in __aexit__
await self.gen.athrow(typ, value, traceback)
File "/pkg/modal/_container_io_manager.py", line 210, in heartbeats
yield
File "/pkg/modal/_container_entrypoint.py", line 788, in main
container_io_manager.memory_snapshot()
File "/pkg/synchronicity/synchronizer.py", line 531, in proxy_method
return wrapped_method(instance, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pkg/synchronicity/combined_types.py", line 28, in __call__
raise uc_exc.exc from None
File "/pkg/modal/_container_io_manager.py", line 646, in memory_snapshot
raise ConnectionError(
ConnectionError: Cannot checkpoint container with open network connections. Are you closing connections before checkpointing?
However, even with this check the following program will fail. This is because closed sockets will wait for a TIME_WAIT period (default of 60 seconds) before closing (details here).
import socket
import requests
import modal
app = modal.App(image=modal.Image.debian_slim().pip_install("requests"))
@app.cls(enable_memory_snapshot=True, max_inputs=1, secrets=[modal.Secret.from_dict({"MODAL_LOGLEVEL": "DEBUG"})])
class SnapshotClass:
def __init__(self):
self._socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
@modal.enter(snap=True)
def lingering_open_sockets(self):
remote_ip = "94.140.14.14" # AdGuard DNS
self._socket.connect((remote_ip, 80))
self._socket.close()
_ = requests.get("https://modal.com")
@modal.method()
def successful_restore(self):
print("connection closed")
@app.local_entrypoint()
def main():
s = SnapshotClass()
s.successful_restore.remote()
I am patching socket to fix that. All socket.close() connections use SO_LINGER while snapshotting, forcing sockets to close immediately. Negative implications of setting this here are minimal. We remove the patch when the container is restored.
Previous work https://github.com/modal-labs/modal-client/pull/1347
@luiscape think you need a @skip somewhere because of a windows test incompatibility
=================================== ERRORS ====================================
__________________ ERROR collecting modal/_vendor/psutil.py ___________________
modal\_vendor\psutil.py:386: in <module>
_connections = Connections()
modal\_vendor\psutil.py:191: in __init__
unix = ("unix", socket.AF_UNIX, None)
E AttributeError: module 'socket' has no attribute 'AF_UNIX'
=========================== short test summary info ===========================
ERROR modal/_vendor/psutil.py - AttributeError: module 'socket' has no attribute 'AF_UNIX'
!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
============================== 1 error in 0.53s ===============================
It seems a bit janky to patch sockets from inside the client – is there a way we can do it from the outside (from the worker)?
It seems a bit janky to patch sockets from inside the client – is there a way we can do it from the outside (from the worker)?
We need to patch the socket-closing behavior only and only during snapshotting. This can also be done by changing kernel configs tcp_fin_timeout, tcp_tw_recycle, and tcp_tw_reuse but we can't change those inside a container without CAP_NET_ADMIN (we won't do this). Afaik this can't be changed from the worker because a socket inside a container isn't visible from outside because gVisor virtualizes it internally.
This is the least janky, safest solution I could find.
Could we make an upstream change to handle this in the runtime, or perhaps just raise an issue describing our problem?
Could we make an upstream change to handle this in the runtime, or perhaps just raise an issue describing our problem?
We can but they are implementing the correct TIME_WAIT behavior. Not sure what the "issue" is for them. Thoughts?
idk if there's some hacky way we can do it from the outside like this maybe? https://superuser.com/questions/127863/manually-closing-a-port-from-commandline
idk if there's some hacky way we can do it from the outside like this maybe?
Let me PoC this in the worker. I think it's possible to just set the right options to open sockets (aka setsockopt) without calling close from the outside. This might be enough.
It turns out that we have a way to configure gvisor to ignore these lingering sockets. We'll go with that solution (see https://github.com/modal-labs/modal/pull/14923) and remove the socket patching.
@luiscape means this can be closed right?
@thundergolfer no. I'll remove the socket patch and leave the error indicating that connections are open (will prevent snapshotting failures with gVisor stack traces).
I am closing this because after our internal change connections are now forcefully closed but sockets remain open. In the worst-case-scenario programs will fail a timeout because of a request without a response but that's very unlikely in the case of snapshotting. Programs will not crash upon restore because socket file descriptors will be restored.
I tested this with the following program:
import socket
import modal
app = modal.App("socket-snapshot-test", image=modal.Image.debian_slim())
@app.cls(enable_memory_snapshot=True, max_inputs=1)
class SnapshotClass:
def __init__(self):
self._socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
@modal.enter(snap=True)
def lingering_open_sockets(self):
remote_ip = "94.140.14.14" # AdGuard DNS
self._socket.connect((remote_ip, 80))
@modal.method()
def successful_restore(self):
if self._socket.fileno() == -1:
print("Socket is closed")
else:
print("Socket is open")
if __name__ == "__main__":
snapshot_class = modal.Cls.lookup("socket-snapshot-test", "SnapshotClass")
snapshot_class.successful_restore.remote()
Snapshot / restore work as expected:
Starting function memory snapshotting. Runner fingerprint 5ab4be0
Snapshot created. Restoring Function from memory snapshot.
Restoring memory snapshot for function. Runner fingerprint: 5ab4be0
Socket is open