ART icon indicating copy to clipboard operation
ART copied to clipboard

Fix async service cleanup in LocalBackend.close()

Open prashant182 opened this issue 7 months ago • 3 comments

Summary

  • Fixed critical resource cleanup bug in LocalBackend.close() that was causing service close methods to not execute properly when they were async
  • Made LocalBackend.close() async to properly handle async service cleanup and match the base Backend class interface
  • Updated __exit__ method to handle both sync and async contexts appropriately

Problem

The close() method was synchronous but attempted to call close() on services that may have async close methods. This meant:

  • Async close methods returned coroutine objects instead of executing
  • Memory leaks and zombie processes accumulated over time
  • Port conflicts occurred when restarting services

Solution

  • Made LocalBackend.close() async and added proper async/sync detection
  • Service close methods are now properly awaited if they're async coroutines
  • Maintained backward compatibility for synchronous close methods
  • Updated __exit__ method to handle event loop contexts correctly

Impact

This fix prevents:

  • GPU memory leaks from improperly closed vLLM engines
  • Zombie training processes
  • Port conflicts on service restart

Critical for production stability and cost management in ML training environments.

Test Plan

  • [x] All existing tests pass
  • [x] Code formatting and linting checks pass
  • [x] Manual verification that both sync and async service close methods work correctly

prashant182 avatar Aug 06 '25 22:08 prashant182