vector icon indicating copy to clipboard operation
vector copied to clipboard

feat(new sink): add Apache Doris sink support

Open bingquanzhao opened this issue 5 months ago • 2 comments
trafficstars

Summary

This PR introduces a new Apache Doris sink for Vector, enabling users to send log data directly to Apache Doris databases using the Stream Load API. The implementation includes:

  • Complete Doris sink implementation with Stream Load API integration
  • Comprehensive configuration options (endpoints, authentication, batching, custom headers)
  • Full documentation generation using CUE
  • Health check functionality with proper error handling
  • Support for Doris-specific Stream Load parameters via custom HTTP headers

Apache Doris is a modern MPP analytical database that provides sub-second query response times on large datasets, making it ideal for real-time data warehouses and log analysis scenarios.

Change Type

  • [x] New feature
  • [ ] Bug fix
  • [ ] Non-functional (chore, refactoring, docs)
  • [ ] Performance

Is this a breaking change?

  • [ ] Yes
  • [x] No

How did you test this PR?

Local Testing

  1. Unit Tests: All unit tests pass with cargo test
  2. Configuration Validation: Verified config parsing with vector validate
  3. Documentation Generation: Successfully generated docs with make generate-component-docs
  4. CUE Validation: All CUE files pass format and validation checks
  5. Changelog Validation: Changelog fragment passes validation with ./scripts/check_changelog_fragments.sh

Test Configuration Used

sources:
  demo:
    type: demo_logs
    format: json
    interval: 1

sinks:
  doris:
    type: doris
    inputs: ["demo"]
    
    # Target configuration
    endpoints: 
      - "http://doris-fe1:8030"
      - "http://doris-fe2:8030"
    database: "analytics_db"
    table: "user_events"
    
    # Authentication configuration
    auth:
      strategy: basic
      user: "admin"
      password: "admin123"
    
    # Batch configuration
    batch:
      max_events: 100000        # Maximum events per batch
      timeout_secs: 30          # Batch timeout in seconds
      max_bytes: 1073741824     # Maximum bytes per batch (1GB)
    
    # Custom HTTP headers for Doris Stream Load
    headers:
      format: "json"
      strip_outer_array: "false"
      read_json_by_line: "true"
    
    # Additional configuration
    label_prefix: "vector"
    log_request: true
    log_progress_interval: 10
    buffer_bound: 1

Environment Setup

  • Tested configuration validation against Vector's validation system
  • Verified health check functionality (attempts connection to configured endpoints)
  • All documentation generation and validation checks pass
  • CUE v0.7.0 used for documentation generation

Does this PR include user facing changes?

  • [x] Yes. Please add a changelog fragment based on our guidelines.
  • [ ] No. A maintainer will apply the "no-changelog" label to this PR.

Notes

Implementation Details

  • Stream Load API: Uses Doris's native Stream Load API for optimal performance and compatibility
  • Authentication: Supports basic authentication with username/password
  • Batching: Configurable batching with event count, byte size, and timeout limits
  • Custom Headers: Support for Doris-specific Stream Load parameters via HTTP headers including:
    • format: Data format specification (json, csv, etc.)
    • read_json_by_line: JSON line-by-line reading mode
    • strip_outer_array: Array handling configuration
    • columns: Column mapping specification
  • Error Handling: Comprehensive error handling with configurable retry logic
  • Health Checks: Validates connectivity and basic authentication
  • Rate Limiting: Built-in rate limiting and adaptive concurrency control

Documentation

  • Added complete CUE documentation for the sink configuration
  • Generated reference documentation automatically using Vector's documentation system
  • Updated service definitions and URL references
  • All documentation validation checks pass (CI=true make check-docs)

Dependencies

  • No new external dependencies added
  • Uses existing Vector HTTP client infrastructure
  • Leverages standard Vector authentication, batching, and request frameworks
  • Follows Vector's established patterns for sink implementation

Code Quality

  • All code formatted with cargo fmt
  • Follows Vector's coding standards and patterns
  • Proper error handling and logging throughout
  • Comprehensive configuration validation

Testing Strategy

  • Configuration validation ensures all options are properly parsed
  • Health check functionality verified through connection attempts
  • Documentation generation confirms all metadata is correctly defined
  • Follows Vector's established testing patterns for sinks

References

  • Apache Doris Stream Load Documentation: https://doris.apache.org/docs/data-operate/import/import-way/stream-load-manual
  • Apache Doris Official Website: https://doris.apache.org
  • Vector Sink Development Guidelines: https://vector.dev/docs/reference/configuration/sinks/

bingquanzhao avatar May 28 '25 16:05 bingquanzhao

CLA assistant check
All committers have signed the CLA.

bits-bot avatar May 28 '25 16:05 bits-bot

Created Jira card for Docs Team review.

drichards-87 avatar May 28 '25 19:05 drichards-87

Hi @bingquanzhao, thank you for this PR. Please rebase on master and fix merge conflicts. There are 12k affected lines right now.

pront avatar Jun 25 '25 17:06 pront

Hi @bingquanzhao, thank you for this PR. Please rebase on master and fix merge conflicts. There are 12k affected lines right now.

Thanks for your reminder, I have executed rebase master

bingquanzhao avatar Jul 09 '25 10:07 bingquanzhao

Still 12k lines added, find the offending commit and revert it or git checkout affected files from master

thomasqueirozb avatar Jul 09 '25 13:07 thomasqueirozb

Still 12k lines added, find the offending commit and revert it or git checkout affected files from master

Thank you for your help. I think I previously executed an incorrect command for generating the document, which resulted in modifying the source files of other parts.

bingquanzhao avatar Jul 11 '25 03:07 bingquanzhao

Are there any other contents that need to be adjusted? Please help to check.

bingquanzhao avatar Aug 25 '25 02:08 bingquanzhao

When can it be launched for use

hanshuishi avatar Sep 09 '25 09:09 hanshuishi

Please merge / resolve conflicts with origin/master. The new make fmt should also update the formatting. In the meantime, we will review this PR soon.

pront avatar Sep 09 '25 19:09 pront

Please merge / resolve conflicts with origin/master. The new make fmt should also update the formatting. In the meantime, we will review this PR soon.

I've executed make fmt, and thanks a lot for merging master into this branch!

bingquanzhao avatar Sep 11 '25 03:09 bingquanzhao

When packaging after merging, these few need to be changed src/sinks/doris/retry.rs 1、use crate::sinks::util::http::HttpRequest; 2、use crate::sinks::doris::sink::DorisPartitionKey; 3、type Request = HttpRequest<DorisPartitionKey>; 4、fn should_retry_response(&self, response: &Self::Response) -> RetryActionSelf::Request

hanshuishi avatar Sep 11 '25 03:09 hanshuishi

Hello, I will review this tomorrow but due to the PR size, it's unlikely that we can merge it before the merge window closes.

pront avatar Sep 16 '25 20:09 pront

Hello, I will review this tomorrow but due to the PR size, it's unlikely that we can merge it before the merge window closes.

Thanks. What can I do if the PR has not been reviewed yet when it is closed?

bingquanzhao avatar Oct 10 '25 02:10 bingquanzhao

Hello, I will review this tomorrow but due to the PR size, it's unlikely that we can merge it before the merge window closes.

Thanks. What can I do if the PR has not been reviewed yet when it is closed?

Hi @bingquanzhao, apologies for the delay. Several other issues came up. We will do our best to review this soon and include it in the next release. Stay tuned for any review comments. Thanks!

pront avatar Oct 10 '25 13:10 pront

Hi @bingquanzhao, thank you for this PR. I did a review focusing on the config UX first.

Hi @pront ,your review comments were very helpful. I've made some code modifications; please continue to review them.

bingquanzhao avatar Oct 30 '25 02:10 bingquanzhao