marquez icon indicating copy to clipboard operation
marquez copied to clipboard

Feature Request: Add Batch Ingestion Endpoint for OpenLineage Events

Open algorithmy1 opened this issue 1 year ago • 2 comments

Currently, the Marquez API for OpenLineage events (/api/v1/lineage) accepts one event per request, as seen in OpenLineageResource.java#L67. While this is suitable for real-time ingestion, it becomes inefficient when we need to ingest multiple events simultaneously.

Use Case:

  • Database Migration or Restoration: When changing the database or restoring from backups, we may need to re-ingest a large number of events to rebuild the lineage graph.
  • Bulk Event Replay: In scenarios like system recovery or batch processing, ingesting events one by one is not practical.
  • Performance Optimization: Reducing the number of HTTP requests can significantly improve ingestion performance.

Proposal:

  • New Endpoint: Introduce a batch ingestion endpoint (e.g., /api/v1/lineage/batch) that accepts an array of OpenLineage events.
  • Batch Processing: Update the OpenLineageResource class to handle a list of events in a single request.
  • Response Format: Provide a response that indicates the success or failure of each event within the batch.

(Or even update the current one /api/v1/lineage to accept both options)

Benefits:

  • Efficiency: Streamlines the ingestion process for multiple events.
  • Scalability: Enhances Marquez's ability to handle large-scale data operations.
  • User Convenience: Simplifies workflows that require bulk event ingestion.

algorithmy1 avatar Oct 09 '24 22:10 algorithmy1

Thanks for the suggestion, @algorithmy1! We couldn't agree more on the benefits you outlined. The good news is that we've been prototyping such an endpoint for OpenLineage batch events, see v2.LineageResource.collectBatchOf(BatchOfEvents). The endpoint will be available in Marquez 0.51.0.

wslulciuc avatar Oct 23 '24 23:10 wslulciuc

Thanks for the suggestion, @algorithmy1! We couldn't agree more on the benefits you outlined. The good news is that we've been prototyping such an endpoint for OpenLineage batch events, see v2.LineageResource.collectBatchOf(BatchOfEvents). The endpoint will be available in Marquez 0.51.0.

@wslulciuc Hello, may I ask if the Marquez project is still being maintained regularly? There hasn't been any update for a long time.

dpengpeng avatar Oct 16 '25 12:10 dpengpeng