DataFed icon indicating copy to clipboard operation
DataFed copied to clipboard

[DAPS-1395] Address Memory Leaks in Core Server

Open JoshuaSBrown opened this issue 8 months ago • 5 comments

PR Description

This PR addresses several memory bugs discovered in the common library and the core server. Though the source code coverage is not exhaustive, it should significantly reduce the need to restart the core server. The main source of error and of bugs is related first to the absence of virtual destructors in derived classes that contain unique_ptrs. By not including them, the base destructor was being called which was failing to cleanup the memory. In addition, cleanup of the ZeroMQ context was addressed. A special note, the ZeroMQ context should be the same when using INPROC and otherwise it is better if it is different for TCP connections to avoid thread contention.

Tasks

  • [x] - A description of the PR has been provided, and a diagram included if it is a new feature.
  • [x] - Formatter has been run
  • [ ] - CHANGELOG comment has been added
  • [x] - Labels have been assigned to the pr
  • [x] - A reviwer has been added
  • [x] - A user has been assigned to work on the pr
  • [ ] - If new feature a unit test has been added

Summary by Sourcery

Address memory leaks in the core server and common library by enforcing virtual destructors and scoped ZeroMQ context management, while enhancing the test suite with Protobuf teardown and enabling valgrind-based memory tests via new CMake options.

New Features:

  • Add ENABLE_MEMORY_TESTS option and integrate valgrind-based memory tests in CMake

Bug Fixes:

  • Fix memory leaks by adding virtual destructors to key interfaces and classes
  • Ensure proper cleanup of ZeroMQ contexts to eliminate context leaks

Enhancements:

  • Introduce InprocContext for shared INPROC ZeroMQ context management with reference counting
  • Update ZeroMQCommunicator (and secure variant) and ProxyBasicZMQ to use InprocContext for context lifecycle

Build:

  • Add CMake flags and conditional memory test targets for valgrind leak detection

Tests:

  • Add global BOOST_GLOBAL_FIXTURE for ProtobufLibrary shutdown in multiple unit tests

JoshuaSBrown avatar May 13 '25 18:05 JoshuaSBrown

Reviewer's Guide

This PR refactors ZeroMQ context management by introducing an InprocContext for scheme-based creation and cleanup, adds missing virtual destructors across core interfaces and classes to ensure proper resource release, integrates global Protobuf shutdown fixtures in unit tests to prevent library leaks, updates CMake test configurations to support optional valgrind-based memory tests, and enhances error handling during ZeroMQ socket initialization.

Sequence Diagram: ZeroMQ INPROC Context Lifecycle

sequenceDiagram
    title "Sequence Diagram: ZeroMQ INPROC Context Lifecycle"
    participant Comm as ZeroMQCommunicator
    participant InprocCtx as InprocContext

    Comm->>InprocCtx: getContext()
    activate InprocCtx
    Note over InprocCtx: Returns shared context (creates if null)
    InprocCtx-->>Comm: context_ptr
    deactivate InprocCtx

    Comm->>InprocCtx: increment()
    activate InprocCtx
    Note over InprocCtx: Increments usage counter
    InprocCtx-->>Comm: 
    deactivate InprocCtx

    Note over Comm: Communicator operates...

    Comm->>InprocCtx: decrement()
    activate InprocCtx
    Note over InprocCtx: Decrements usage counter
    InprocCtx-->>Comm: 
    deactivate InprocCtx

    Comm->>InprocCtx: get() (to check counter)
    activate InprocCtx
    InprocCtx-->>Comm: current_counter_value
    deactivate InprocCtx

    alt if counter == 0
        Comm->>InprocCtx: resetContext()
        activate InprocCtx
        Note over InprocCtx: Terminates and nullifies shared context
        InprocCtx-->>Comm: status_code
        deactivate InprocCtx
    end

Sequence Diagram: ZeroMQ Non-INPROC Context Lifecycle

sequenceDiagram
    title "Sequence Diagram: ZeroMQ Non-INPROC Context Lifecycle"
    participant Comm as ZeroMQCommunicator
    participant ZMQ_Lib as ZeroMQ Library

    Note over Comm: Constructor for non-INPROC socket
    Comm->>ZMQ_Lib: zmq_ctx_new()
    activate ZMQ_Lib
    ZMQ_Lib-->>Comm: new_context_ptr
    deactivate ZMQ_Lib

    Note over Comm: Communicator operates with its own context...

    Note over Comm: Destructor
    Comm->>ZMQ_Lib: zmq_ctx_term(context_ptr)
    activate ZMQ_Lib
    ZMQ_Lib-->>Comm: status_code
    deactivate ZMQ_Lib

File-Level Changes

Change Details Files
Scheme-based ZeroMQ context lifecycle management
  • Introduce InprocContext with atomic counter and context creation/termination methods
  • Use InprocContext for INPROC sockets and zmq_ctx_new() for other schemes in communicators
  • Replace getContext() calls in ProxyBasicZMQ with InprocContext::getContext()
  • Handle context termination in communicator destructors based on counter or direct term()
common/source/support/zeromq/Context.hpp
common/source/communicators/ZeroMQCommunicator.cpp
common/source/communicators/ZeroMQCommunicatorSecure.cpp
common/source/servers/ProxyBasicZMQ.cpp
Add missing virtual destructors to polymorphic classes
  • Declare virtual ~...() in ISocket, ICredentials, IMessage, IOperator, IServer
  • Add destructors to concrete classes: ZeroMQSocket, GoogleProtoMessage, ZeroMQSocketCredentials
  • Extend destructors to operators and servers: AuthenticationOperator, RouterBookKeepingOperator, ZeroMQCommunicatorSecure, Proxy, ProxyBasicZMQ
common/include/common/ISocket.hpp
common/source/sockets/ZeroMQSocket.hpp
common/include/common/ICredentials.hpp
common/include/common/IMessage.hpp
common/include/common/IOperator.hpp
common/include/common/IServer.hpp
common/source/credentials/ZeroMQSocketCredentials.hpp
common/source/messages/GoogleProtoMessage.hpp
common/source/operators/AuthenticationOperator.hpp
common/source/operators/RouterBookKeepingOperator.hpp
common/source/communicators/ZeroMQCommunicatorSecure.hpp
common/source/servers/Proxy.hpp
common/source/servers/ProxyBasicZMQ.hpp
Global Protobuf shutdown fixture in unit tests
  • Define GlobalProtobufTeardown struct calling ShutdownProtobufLibrary
  • Register fixture via BOOST_GLOBAL_FIXTURE in each test file
common/tests/unit/test_Buffer.cpp
common/tests/unit/test_CommunicatorFactory.cpp
common/tests/unit/test_Frame.cpp
common/tests/unit/test_MessageFactory.cpp
common/tests/unit/test_OperatorFactory.cpp
common/tests/unit/test_ProtoBufFactory.cpp
common/tests/unit/test_ProtoBufMap.cpp
common/tests/unit/test_ProxyBasicZMQ.cpp
common/tests/unit/test_Proxy.cpp
core/server/tests/unit/test_AuthMap.cpp
core/server/tests/unit/test_AuthenticationManager.cpp
repository/gridftp/globus5/authz/tests/unit/test_AuthzWorker.cpp
repository/gridftp/globus5/authz/tests/unit/test_getVersion.cpp
Conditional memory testing in CMake
  • Wrap unit and memory test registrations under ENABLE_UNIT_TESTS and ENABLE_MEMORY_TESTS
  • Add valgrind-based add_test commands guarded by ENABLE_MEMORY_TESTS
common/tests/unit/CMakeLists.txt
core/server/tests/unit/CMakeLists.txt
common/tests/CMakeLists.txt
core/server/tests/CMakeLists.txt
CMakeLists.txt
Enhanced error handling for ZeroMQ socket creation
  • Check m_zmq_socket for nullptr after zmq_socket call
  • Retrieve zmq_errno() and error string, then throw EXCEPT_PARAM on failure
common/source/communicators/ZeroMQCommunicatorSecure.cpp

Possibly linked issues

  • #1395: PR addresses memory leaks by adding virtual destructors and ZeroMQ context cleanup.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an issue from a review comment by replying to it. You can also reply to a review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull request title to generate a title at any time. You can also comment @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in the pull request body to generate a PR summary at any time exactly where you want it. You can also comment @sourcery-ai summary on the pull request to (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the pull request to resolve all Sourcery comments. Useful if you've already addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull request to dismiss all existing Sourcery reviews. Especially useful if you want to start fresh with a new review - don't forget to comment @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

  • Contact our support team for questions or feedback.
  • Visit our documentation for detailed guides and information.
  • Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai[bot] avatar May 13 '25 18:05 sourcery-ai[bot]

@par-hermes format

JoshuaSBrown avatar May 13 '25 18:05 JoshuaSBrown

Observing test failure when same test is run in succession. First time passes every time after that it fails.

 LD_LIBRARY_PATH=/opt/datafed/dependencies/lib /opt/datafed/dependencies/bin/ctest -R mock
Test project /home/brownjs/software/open/datafed/DataFed/build-memory-2
    Start 33: start_mock
1/3 Test #33: start_mock .......................   Passed    2.01 sec
    Start 38: mock_liveness_test_getVersion
2/3 Test #38: mock_liveness_test_getVersion ....   Passed    0.22 sec
    Start 34: stop_mock
3/3 Test #34: stop_mock ........................   Passed    0.00 sec

100% tests passed, 0 tests failed out of 3

Total Test time (real) =   2.24 sec
brownjs@LAP142548:~/software/open/datafed/DataFed/build-memory-2$ LD_LIBRARY_PATH=/opt/datafed/dependencies/lib /opt/datafed/dependencies/bin/ctest -R mock
Test project /home/brownjs/software/open/datafed/DataFed/build-memory-2
    Start 33: start_mock
1/3 Test #33: start_mock .......................   Passed    2.01 sec
    Start 38: mock_liveness_test_getVersion
2/3 Test #38: mock_liveness_test_getVersion ....***Failed    2.02 sec
    Start 34: stop_mock
3/3 Test #34: stop_mock ........................   Passed    0.00 sec

JoshuaSBrown avatar May 15 '25 10:05 JoshuaSBrown

Looks like a timing issue. Running with verbose leads to the tests passing, but running without leads to failure.

/opt/datafed/dependencies/bin/ctest --verbose -R mock # passes
/opt/datafed/dependencies/bin/ctest -R mock # fails

JoshuaSBrown avatar May 15 '25 11:05 JoshuaSBrown

Actually, now I'm seeing it pass even without verbose. This is making testing of this application brittle. Out of 10 test runs 4 are failing. In another run 6/20 are failing with a timeout.

Adding send and receive retries with the same communicators (zeromq sockets) on test don't help seeing 7/20 failures.

Recreating sockets and zeromq sockets on retries seems to improve things seeing 2/20 failure, and 3/20 failure.

Rerunning tests with pause between bringing down server leads to 5/20 failures.

Removing pauses between server start and client send does not change anything 5/20, 3/20/ and 6/20 failures.

Increasing pauses after creation and after shutdown to 5 seconds does not help same ratio of failures.

JoshuaSBrown avatar May 15 '25 11:05 JoshuaSBrown

ZeroMQ errors and message drops were addressed in a separate issue and PR https://github.com/ORNL/DataFed/issues/1437.

They need to be merged in before the CI will pass. ZeroMQ issues have lead to instability in the integration tests.

JoshuaSBrown avatar May 21 '25 18:05 JoshuaSBrown

@par-hermes format

JoshuaSBrown avatar May 22 '25 18:05 JoshuaSBrown