modal-client icon indicating copy to clipboard operation
modal-client copied to clipboard

Verify state of TCP sockets when creating memory snapshots

Open luiscape opened this issue 1 year ago • 7 comments

A program with open TCP connections will emit a clear error and indicate the connections open. Open connections prevent memory snapshots from succeeding. Adding this check allows for us to turn on networking during memory snapshots.

[Thread-1 (thread_inner)] 2024-07-23T21:09:55+0000 Remote Address: 94.140.14.14:80, Status: ESTABLISHED
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/pkg/modal/_container_entrypoint.py", line 862, in <module>
    main(container_args, client)
  File "/usr/local/lib/python3.11/contextlib.py", line 222, in __aexit__
    await self.gen.athrow(typ, value, traceback)
  File "/pkg/modal/_container_io_manager.py", line 210, in heartbeats
    yield
  File "/pkg/modal/_container_entrypoint.py", line 788, in main
    container_io_manager.memory_snapshot()
  File "/pkg/synchronicity/synchronizer.py", line 531, in proxy_method
    return wrapped_method(instance, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pkg/synchronicity/combined_types.py", line 28, in __call__
    raise uc_exc.exc from None
  File "/pkg/modal/_container_io_manager.py", line 646, in memory_snapshot
    raise ConnectionError(
ConnectionError: Cannot checkpoint container with open network connections. Are you closing connections before checkpointing?

However, even with this check the following program will fail. This is because closed sockets will wait for a TIME_WAIT period (default of 60 seconds) before closing (details here).

import socket
import requests
import modal

app = modal.App(image=modal.Image.debian_slim().pip_install("requests"))


@app.cls(enable_memory_snapshot=True, max_inputs=1, secrets=[modal.Secret.from_dict({"MODAL_LOGLEVEL": "DEBUG"})])
class SnapshotClass:
    def __init__(self):
        self._socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

    @modal.enter(snap=True)
    def lingering_open_sockets(self):
        remote_ip = "94.140.14.14"  # AdGuard DNS
        self._socket.connect((remote_ip, 80))
        self._socket.close()

        _ = requests.get("https://modal.com")

    @modal.method()
    def successful_restore(self):
        print("connection closed")


@app.local_entrypoint()
def main():
    s = SnapshotClass()
    s.successful_restore.remote()

I am patching socket to fix that. All socket.close() connections use SO_LINGER while snapshotting, forcing sockets to close immediately. Negative implications of setting this here are minimal. We remove the patch when the container is restored.

Previous work https://github.com/modal-labs/modal-client/pull/1347

luiscape avatar Jul 23 '24 21:07 luiscape

@luiscape think you need a @skip somewhere because of a windows test incompatibility

=================================== ERRORS ====================================
__________________ ERROR collecting modal/_vendor/psutil.py ___________________
modal\_vendor\psutil.py:386: in <module>
    _connections = Connections()
modal\_vendor\psutil.py:191: in __init__
    unix = ("unix", socket.AF_UNIX, None)
E   AttributeError: module 'socket' has no attribute 'AF_UNIX'
=========================== short test summary info ===========================
ERROR modal/_vendor/psutil.py - AttributeError: module 'socket' has no attribute 'AF_UNIX'
!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
============================== 1 error in 0.53s ===============================

thundergolfer avatar Jul 24 '24 15:07 thundergolfer

It seems a bit janky to patch sockets from inside the client – is there a way we can do it from the outside (from the worker)?

erikbern avatar Jul 24 '24 18:07 erikbern

It seems a bit janky to patch sockets from inside the client – is there a way we can do it from the outside (from the worker)?

We need to patch the socket-closing behavior only and only during snapshotting. This can also be done by changing kernel configs tcp_fin_timeout, tcp_tw_recycle, and tcp_tw_reuse but we can't change those inside a container without CAP_NET_ADMIN (we won't do this). Afaik this can't be changed from the worker because a socket inside a container isn't visible from outside because gVisor virtualizes it internally.

This is the least janky, safest solution I could find.

luiscape avatar Jul 24 '24 19:07 luiscape

Could we make an upstream change to handle this in the runtime, or perhaps just raise an issue describing our problem?

thundergolfer avatar Jul 24 '24 20:07 thundergolfer

Could we make an upstream change to handle this in the runtime, or perhaps just raise an issue describing our problem?

We can but they are implementing the correct TIME_WAIT behavior. Not sure what the "issue" is for them. Thoughts?

luiscape avatar Jul 26 '24 20:07 luiscape

idk if there's some hacky way we can do it from the outside like this maybe? https://superuser.com/questions/127863/manually-closing-a-port-from-commandline

erikbern avatar Jul 27 '24 02:07 erikbern

idk if there's some hacky way we can do it from the outside like this maybe?

Let me PoC this in the worker. I think it's possible to just set the right options to open sockets (aka setsockopt) without calling close from the outside. This might be enough.

luiscape avatar Jul 31 '24 13:07 luiscape

It turns out that we have a way to configure gvisor to ignore these lingering sockets. We'll go with that solution (see https://github.com/modal-labs/modal/pull/14923) and remove the socket patching.

luiscape avatar Aug 16 '24 20:08 luiscape

@luiscape means this can be closed right?

thundergolfer avatar Aug 16 '24 23:08 thundergolfer

@thundergolfer no. I'll remove the socket patch and leave the error indicating that connections are open (will prevent snapshotting failures with gVisor stack traces).

luiscape avatar Aug 17 '24 16:08 luiscape

I am closing this because after our internal change connections are now forcefully closed but sockets remain open. In the worst-case-scenario programs will fail a timeout because of a request without a response but that's very unlikely in the case of snapshotting. Programs will not crash upon restore because socket file descriptors will be restored.

I tested this with the following program:

import socket

import modal

app = modal.App("socket-snapshot-test", image=modal.Image.debian_slim())


@app.cls(enable_memory_snapshot=True, max_inputs=1)
class SnapshotClass:
    def __init__(self):
        self._socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

    @modal.enter(snap=True)
    def lingering_open_sockets(self):
        remote_ip = "94.140.14.14"  # AdGuard DNS
        self._socket.connect((remote_ip, 80))

    @modal.method()
    def successful_restore(self):
        if self._socket.fileno() == -1:
            print("Socket is closed")
        else:
            print("Socket is open")


if __name__ == "__main__":
    snapshot_class = modal.Cls.lookup("socket-snapshot-test", "SnapshotClass")
    snapshot_class.successful_restore.remote()

Snapshot / restore work as expected:

Starting function memory snapshotting. Runner fingerprint 5ab4be0
Snapshot created. Restoring Function from memory snapshot.
Restoring memory snapshot for function. Runner fingerprint: 5ab4be0
Socket is open

luiscape avatar Aug 19 '24 18:08 luiscape