arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

How should the server cleanup after the client?

Open e-kayrakli opened this issue 2 months ago • 9 comments

Right now, __del__ method is the way where pdarrays are cleaned up from the server's symbol table. This relies on the Python interpreter calling __del__. However, when __del__ is called is not that well-defined. The interpreter can choose to destroy objects in any order for example. More concerning, __del__ is not guaranteed to be called for objects that are still being held by the interpreter at the end of a script.

Because of this, we are currently leaving memory allocated on the server at the end of a script's execution. By nature, server should be able to run fine after a batch script finishes, but because of this, we run into OOM issues when multiple scripts are run in batch using a single server session. That's how our nightly performance infrastructure works today.

I have several proposals to mitigate the situation. They get progressively bigger in terms of interface/behavior.

  1. Add a ak.cleanup method of sorts which would remove and delete all allocations held by the calling client. It doesn't result in disconnect or shutdown. No change in other behavior, so this is a 100% additive change.

  2. Make ak.disconnect call cleanup. One side of me things that this is too big a hammer -- I might just disconnect now only to connect back later, why should that cause my data to be deleted. However, Arkouda keeps its data strictly in memory, which is scarce. So, by default, a client disconnecting, should wipe their data.

  3. Add pdarray.persist to keep a pdarray in the symbol table even after disconnect. cleanup still removes this data. Or maybe that behavior is controlled with a flag.

  4. Register cleanup with atexit when connect is called and returns successfully. This can guarantee that the server's allocations will be freed up when the client exists even if there was no call to disconnect.

  5. Add some kind of context manager to be able to do with ak.server(...): where connect/cleanup/disconnect is handled under the hood. This is also a purely additive interface that can be added regardless of any other items here.

e-kayrakli avatar Nov 05 '25 20:11 e-kayrakli

weakref.finalize could be an alternative to atexit for item 4.

e-kayrakli avatar Nov 05 '25 21:11 e-kayrakli

A little more context for weakref.finalize just because I purposed it:

  1. Often times __del__ in the data model is incorrectly used as a "destructor" to run some code when the reference count of an object hits zero. Python does not make this guarantee, and __del__ comes with some caveats (e.g. it may not be called (consider an object has a reference count >=1 before interpreter exits), it may be called more than once (currently only possible when running in non-cpython), it automatically suppresses exceptions, etc.).

  2. If the __del__ methods are meant to act as "destrutors" and the problem is specifically that __del__ "is not guaranteed to be called for objects that are still being held by the interpreter at the end of a script", in the data model docs for __del__, python recommends transitioning to weakref.finalize.

  3. weakref.finalize is implemented in such a way that that should guarantee that finalizers are called before the interpreter shuts down AND that globals have not been garbage collected.

  4. weakref.finalize, like __del__, will still suppress exceptions if they are called as part of garbage collection (but not if the finalizer is invoked manually).

Ultimately if what we are looking for is "deinit" method to run when a reference count hits zero, I would strongly recommend weakref.finalize, but if we are looking for something more akin to a "clean up ALL the memory for this entire connection" type solution an ak.cleanup/atexit.register solution would be more appropriate.

MattToast avatar Nov 05 '25 22:11 MattToast

Thanks, @MattToast, that's really helpful.

Ultimately if what we are looking for is "deinit" method to run when a reference count hits zero, I would strongly recommend weakref.finalize, but if we are looking for something more akin to a "clean up ALL the memory for this entire connection" type solution an ak.cleanup/atexit.register solution would be more appropriate.

My reflex here is "why not both?"

While my main motivator for opening this issue is the latter, where I noticed some large allocations kept by the server after the client interpreter exited, refcount based garbage collection also makes sense, at least for pdarrays. Here's an untested snippet where I think weakref.finalize can help, but atexit is irrelevant:

if some_flag:
  arr = ak.zeros(1_000_000_000)  # allocates 8Gs of memory on the server
  do_something(arr)

# I will not use arr again, but technically I can in Python
# is arr freed here or not ?
# without weakref, I think there is no clear answer for that, it depends on
# what the interpreter/python runtime wants at that moment

if some_flag:
  brr = ak.zeros(1_000_000_000)  # allocates another 8Gs of memory on the server
  do_something(brr)

I expect Python interpreters to be smart enough to know when to reclaim memory, However, they can only see the client's memory consumption and make decisions based solely on that. In Arkouda, client's memory footprint is mostly irrelevant. So, tighter memory reclamation for server via something like weakref.finalize seems definitely worthwhile.

As for atexit, I think in general, the client should always inform the server that it is about to drop connection. Currently, the user is not forced to that (and I don't think they should be) themselves. But I think there should be safeguards for informing the server for scripts that end relatively gracefully, but without an explicit disconnect. This is more about better session management, though it can have memory implications, if we decide that disconnect implies cleanup. Which is proposed in (2) above.

e-kayrakli avatar Nov 06 '25 14:11 e-kayrakli

I’d like to propose we combine two ideas:

  • weakref.finalize on the Python side
  • a simple session-ID mechanism on the server to make Arkouda’s memory cleanup reliable.

The server would assign each client a session ID and tag all allocations with it, so the server can always answer “which arrays belong to this client?” and safely bulk-delete them when appropriate. On the Python side, we replace __del__ with weakref.finalize, which lets us send delete requests promptly when objects become unreachable during normal execution. That gives us incremental cleanup while the script runs, and the session teardown gives us a hard guarantee that anything left over (due to reference cycles, missed finalizers, or interpreter shutdown) gets reclaimed. Cleanup could be made optional: disconnect can call it by default to avoid leaks, but users who intentionally plan to reconnect and preserve state could choose disconnect(skip_cleanup=True).

ajpotts avatar Dec 09 '25 18:12 ajpotts

a simple session-ID mechanism on the server to make Arkouda’s memory cleanup reliable.

Should this involve some kind of heartbeat from the server to client to check if the client is still there?

e-kayrakli avatar Dec 09 '25 21:12 e-kayrakli

I thought it could be triggered from the client side and wouldn't need that in the normal situation. If we want it to be resilient to improper disconnect then that might be a good idea to add eventually.

ajpotts avatar Dec 10 '25 00:12 ajpotts

Oh, I see. Also re-reading your comment, essentially disconnect would trigger that ID-based cleanup on the server side on a graceful exit.

How do you feel about using a context manager instead of uncoupled connect/disconnect calls as we do today? I can imagine users forgoing ak.disconnect. It is definitely easy to do.

A more "sanctioned" way of using the server from the client could look like:

with ak.server():
  # do your thing here, or
  main()

def main():
  # do your thing here

e-kayrakli avatar Dec 10 '25 16:12 e-kayrakli

I'm not sure that would work well with the data science community, especially with how jupyter notebooks work.

ajpotts avatar Dec 16 '25 14:12 ajpotts

I'm not sure that would work well with the data science community, especially with how jupyter notebooks work.

Oh, I see I think. You mean that introducing big scopes to put all your code in is not very amenable to programming where you have small snippets intermingled with non-code etc. Is that right? Or is there more to it?

e-kayrakli avatar Dec 17 '25 15:12 e-kayrakli