druid icon indicating copy to clipboard operation
druid copied to clipboard

Add segment id parameter to segment metadata query

Open asdf2014 opened this issue 8 months ago • 1 comments

Description

This proposes to enhance the SegmentMetadataQuery by introducing a new optional parameter: segmentIds. This parameter allows users to query metadata for specific segments directly by their segmentId, rather than relying solely on interval-based filtering.

Motivation

This feature will be particularly useful for use cases such as:

  • Debugging or inspecting individual segments;
  • Validating the state of a known segment after ingestion or compaction;
  • Programmatic access in custom tooling where segment IDs are already known.

Proposed Changes

  1. Query Definition Layer
    • Extend SegmentMetadataQuery to include a List<String> segmentIds field.
    • Ensure proper serialization/deserialization with Jackson.
    • Update equality, hashCode, and toString logic accordingly.
  2. Query Runner
    • Modify SegmentMetadataQueryRunner to evaluate and skip segments whose segmentId is not in the provided list.
  3. Query Planning / Timeline Resolution
    • Update CachingClusteredClient (on the Broker) to support filtering segments by segmentId before dispatching queries.
    • Introduce a utility to map segmentId to SegmentDescriptor, or extend VersionedIntervalTimeline if appropriate.
  4. Backward Compatibility
    • The new parameter will be optional and non-intrusive: if not specified, current behavior is preserved.
  5. Testing
    • Add unit tests for query definition, runner logic, and broker-level filtering behavior.
    • Extend integration tests to cover mixed queries with and without segmentIds.

Impacted Classes

The following classes are expected to be modified as part of this change:

  • org.apache.druid.query.metadata.metadata.SegmentMetadataQuery
  • org.apache.druid.query.metadata.metadata.SegmentMetadataQueryRunner
  • org.apache.druid.client.CachingClusteredClient
  • org.apache.druid.query.SegmentDescriptor
  • org.apache.druid.timeline.VersionedIntervalTimeline (if necessary to locate segments by ID)
  • org.apache.druid.segment.ReferenceCountingSegment (for ID exposure)
  • org.apache.druid.query.QueryToolChest (for caching or context changes)
  • org.apache.druid.query.QueryRunnerTestHelper (for test support)

Example Usage

Query part

{
  "queryType": "segmentMetadata",
  "dataSource": "sample_datasource",
  "segmentIds": [
    "sample_datasource_2025-12-01T00:00:00.000Z_2025-12-02T00:00:00.000Z_2025-12-02T00:00:00.000Z_v1"
  ]
}

Response part

[
  {
    "id": "sample_datasource_2025-12-01T00:00:00.000Z_2025-12-02T00:00:00.000Z_2025-12-02T00:00:00.000Z_v1",
    "intervals": ["2025-12-01T00:00:00.000Z/2025-12-02T00:00:00.000Z"],
    "columns": {
      "__time": {
        "type": "LONG",
        "typeSignature": "LONG",
        "hasMultipleValues": false,
        "hasNulls": false,
        "size": 800000,
        "cardinality": null,
        "errorMessage": null
      },
      "user_id": {
        "type": "STRING",
        "typeSignature": "STRING",
        "hasMultipleValues": false,
        "hasNulls": false,
        "size": 2000000,
        "cardinality": 135000,
        "errorMessage": null
      },
      "event_type": {
        "type": "STRING",
        "typeSignature": "STRING",
        "hasMultipleValues": false,
        "hasNulls": true,
        "size": 500000,
        "cardinality": 25,
        "errorMessage": null
      },
      "metric_clicks": {
        "type": "FLOAT",
        "typeSignature": "FLOAT",
        "hasMultipleValues": false,
        "hasNulls": false,
        "size": 1000000,
        "cardinality": null,
        "errorMessage": null
      }
    },
    "aggregators": {
      "metric_clicks": {
        "type": "floatSum",
        "name": "metric_clicks",
        "fieldName": "metric_clicks"
      }
    },
    "queryGranularity": {
      "type": "minute"
    },
    "size": 4500000,
    "numRows": 1000000,
    "rollup": false
  }
]

Testing

Unit Tests

  • Add tests in SegmentMetadataQueryTest to validate correct behavior when segmentIds is provided or omitted.
  • Extend SegmentMetadataQueryRunnerTest to ensure only the specified segments are queried.
  • Add test coverage for edge cases, such as empty or non-existent segmentIds.

Integration Tests

  • Update or extend ITSegmentMetadataTest to include scenarios using the new segmentIds parameter.
  • Add new tests that:
    • Query metadata for a single known segment.
    • Query with multiple segment IDs across intervals.
    • Query with a mix of valid and invalid segment IDs (expect partial results or error handling).
    • Validate compatibility with existing query context parameters (e.g., toInclude, merge, etc.).
  • Verify that the query returns accurate and expected results without performance regressions.

Alternatives Considered

And considered performing this filtering at the client side, but that requires unnecessarily querying irrelevant segments, which is inefficient for large datasources. Implementing it natively at the Broker and QueryRunner layers is more scalable and consistent.

Backward Compatibility

The introduction of the segmentIds parameter will be designed to be optional and will not break any existing functionality. If the segmentIds parameter is not provided in the query, the current behavior based on interval filtering will remain unchanged.

However, we recognize that this new feature might require certain modifications in existing systems or tooling, especially for users who rely on interval-based querying for segment metadata. To mitigate any potential compatibility issues:

  1. Query Compatibility:

    • If segmentIds is used alongside intervals, the query will return metadata only for segments whose segmentId matches the provided list, within the specified interval.
    • If no segmentIds are provided, the system will continue to use the interval-based filtering mechanism, ensuring seamless backward compatibility.
  2. Documentation and Communication:

    • Documentation will be updated to highlight this new optional parameter, with examples for both use cases, one with and one without the segmentIds parameter.
    • Users who have been using segment metadata queries with interval-based filtering will not experience any changes unless they explicitly choose to use the segmentIds parameter.
  3. Feature Flagging:

    • To ensure smooth rollout, this feature could be initially introduced behind a feature flag, allowing users to opt-in and test the new functionality before enabling it fully in production environments.
  4. Fallback Mechanism:

    • If a segmentId does not exist (e.g., due to a typo or missing segment), the query will gracefully handle the error, either by returning an empty result for the invalid segmentId or providing an appropriate error message, depending on the desired behavior.

By implementing this optional parameter in a non-intrusive manner, the overall system remains compatible with existing workloads and users are given the flexibility to adopt the new feature at their discretion.

asdf2014 avatar May 01 '25 12:05 asdf2014

It is actually possible to do this today! Although, the way you do it is not documented. But maybe we should document it? It involves placing a list of segments in the intervals field of a query. This feature exists because when the Broker passes down your query to Historicals, it replaces the intervals with the list of segments that the specific Historical should be querying.

It looks like:

{
  "queryType": "segmentMetadata",
  "dataSource": "sample_datasource",
  "intervals": {
    "type": "segments",
    "segments": [
      {
        "itvl": "2025-12-01T00:00:00.000Z/2025-12-02T00:00:00.000Z",
        "ver": "2025-12-02T00:00:00.000Z",
        "part": 1
      }
    ]
  }
}

It is indeed possible to do this with any query type.

gianm avatar May 08 '25 06:05 gianm