ch-tools icon indicating copy to clipboard operation
ch-tools copied to clipboard

Actual listing table for S3 orphaned objects monitoring

Open NikitaUnisikhin opened this issue 7 months ago • 1 comments

Summary by Sourcery

Implement persistent listing table management for S3 orphaned objects cleanup using Zookeeper and ClickHouse remote table functions.

New Features:

  • Track listing table ownership in Zookeeper under object-storage-cleanup-state/listing-table-info
  • Query the listing table across replicas via ClickHouse remote() or remoteSecure() engine

Enhancements:

  • Propagate ClickHouse client credentials through cleanup and insert functions
  • Wrap create, drop, and insert table queries in Query objects with sensitive_args to handle passwords securely
  • Automatically clean up Zookeeper nodes when dropping the listing table

Chores:

  • Introduce has_zk helper in Zookeeper utils based on replica count
  • Add a version.txt placeholder file

NikitaUnisikhin avatar May 29 '25 09:05 NikitaUnisikhin

Reviewer's Guide

This PR implements a persistent ClickHouse MergeTree table for S3 object listing with secure credential propagation, coordinates table ownership across replicas using ZooKeeper, and dynamically selects remote engine based on port configuration.

Sequence Diagram for Conditional Final Drop of Listing Table with ZooKeeper

sequenceDiagram
    actor UserScript as "User/Script"
    participant CurrentCH as "Current CH Node"
    participant ZooKeeper

    Note over UserScript, CurrentCH: In 'finally' block of clean() function
    opt Not keep_paths (Cleanup requested)
        CurrentCH->>CurrentCH: effective_listing_table // This is the table name string (local or remote) defined earlier in clean()
        CurrentCH->>CurrentCH: Call _drop_table(effective_listing_table, user_password) // May target a remote owner

        alt has_zk() is true AND check_zk_node(ZK_PATH_LISTING_TABLE_INFO) is true
            CurrentCH->>ZooKeeper: delete_zk_node(ZK_PATH_LISTING_TABLE_INFO)
        end
    end

Entity Relationship Diagram for S3 Object Listing Storage

erDiagram
    listing_table {
        String obj_path PK "S3 Object Path"
        UInt64 obj_size "Object Size (bytes)"
        String storage_policy "ClickHouse storage policy for this table's data"
    }
    note "ClickHouse MergeTree table storing S3 object paths and sizes. Ordered by obj_path." for listing_table

    zk_listing_table_info {
        String replica_owner PK "Hostname of CH replica owning the listing_table"
    }
    note "Data in ZooKeeper node at ZK_PATH_LISTING_TABLE_INFO. Coordinates ownership of the 'listing_table' in a multi-replica setup." for zk_listing_table_info

Class Diagram of Changes in clean_object_storage.py and Related Components

classDiagram
    class clean_object_storage_module {
      <<Python Module>>
      +ZK_PATH_LISTING_TABLE_INFO : string
      +clean(ctx, use_saved_list, keep_paths, ...)
      -_clean_object_storage(ctx, ch_client, user_name, user_password, ...)
      -_traverse_object_storage(ctx, listing_table, user_password, ...)
      -_insert_listing_batch(ctx, obj_paths_batch, listing_table, user_password)
      -_drop_table(ctx, table_name, user_password)
    }
    note for clean_object_storage_module "Functions updated to use Query objects for secure credential handling and ZooKeeper for multi-replica coordination."

    class Query {
      <<Value Object>>
      +statement : string
      +sensitive_args : map
      +Query(statement: string, sensitive_args: map)
    }
    note for Query "New abstraction for SQL queries, separating sensitive arguments like passwords."

    class ch_tools.common.clickhouse.client {
      <<Utility Module>>
      +execute_query(ctx, query_or_string, sensitive_args, format_): any
    }
    note for ch_tools.common.clickhouse.client "execute_query is used to run SQL, now handles Query objects."

    class ch_tools.chadmin.internal.zookeeper {
      <<Utility Module>>
      +has_zk(ctx) : bool
      +check_zk_node(ctx, path) : bool
      +create_zk_nodes(ctx, paths, make_parents)
      +update_zk_nodes(ctx, path, data_bytes)
      +get_zk_node(ctx, path) : bytes
      +delete_zk_node(ctx, path)
    }
    note for ch_tools.chadmin.internal.zookeeper "Provides ZooKeeper interaction helpers, including new has_zk function."

    class ch_tools.common.clickhouse.client.ClickhouseClient {
        +user: string
        +password: string
        +host: string
        +check_port(ClickhousePort): bool
    }
    note for ch_tools.common.clickhouse.client.ClickhouseClient "Used to fetch credentials, host, and check port for remoteSecure selection."

    clean_object_storage_module ..> Query : creates >
    clean_object_storage_module ..> ch_tools.common.clickhouse.client : uses execute_query >
    clean_object_storage_module ..> ch_tools.chadmin.internal.zookeeper : uses ZK functions >
    clean_object_storage_module ..> ch_tools.common.clickhouse.client.ClickhouseClient : uses >

File-Level Changes

Change Details Files
Securely manage listing table lifecycle using Query abstraction
  • Wrap CREATE, INSERT, and DROP statements in Query with sensitive_args
  • Propagate user_password into create, insert, and drop functions
  • Update _drop_table signature to accept user_password
ch_tools/common/commands/clean_object_storage.py
Coordinate listing table ownership across replicas with ZooKeeper
  • Define ZK_PATH_LISTING_TABLE_INFO constant
  • Detect multi-replica clusters and create/check zk node
  • Store and retrieve replica_owner in ZooKeeper
  • Delete zk node when dropping the listing table
ch_tools/common/commands/clean_object_storage.py
ch_tools/chadmin/internal/zookeeper.py
Automatically select appropriate remote engine based on port
  • Check for secure or non-secure TCP port to set remote_clause
  • Raise error if neither port is defined
  • Use remote_clause when referencing listing_table in multi-replica mode
ch_tools/common/commands/clean_object_storage.py
Add placeholder version file
  • Introduce version.txt in repo root
version.txt

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an issue from a review comment by replying to it. You can also reply to a review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull request title to generate a title at any time. You can also comment @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in the pull request body to generate a PR summary at any time exactly where you want it. You can also comment @sourcery-ai summary on the pull request to (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the pull request to resolve all Sourcery comments. Useful if you've already addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull request to dismiss all existing Sourcery reviews. Especially useful if you want to start fresh with a new review - don't forget to comment @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

  • Contact our support team for questions or feedback.
  • Visit our documentation for detailed guides and information.
  • Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai[bot] avatar May 29 '25 09:05 sourcery-ai[bot]