vector
vector copied to clipboard
feat(new sink): add Apache Doris sink support
Summary
This PR introduces a new Apache Doris sink for Vector, enabling users to send log data directly to Apache Doris databases using the Stream Load API. The implementation includes:
- Complete Doris sink implementation with Stream Load API integration
- Comprehensive configuration options (endpoints, authentication, batching, custom headers)
- Full documentation generation using CUE
- Health check functionality with proper error handling
- Support for Doris-specific Stream Load parameters via custom HTTP headers
Apache Doris is a modern MPP analytical database that provides sub-second query response times on large datasets, making it ideal for real-time data warehouses and log analysis scenarios.
Change Type
- [x] New feature
- [ ] Bug fix
- [ ] Non-functional (chore, refactoring, docs)
- [ ] Performance
Is this a breaking change?
- [ ] Yes
- [x] No
How did you test this PR?
Local Testing
- Unit Tests: All unit tests pass with
cargo test - Configuration Validation: Verified config parsing with
vector validate - Documentation Generation: Successfully generated docs with
make generate-component-docs - CUE Validation: All CUE files pass format and validation checks
- Changelog Validation: Changelog fragment passes validation with
./scripts/check_changelog_fragments.sh
Test Configuration Used
sources:
demo:
type: demo_logs
format: json
interval: 1
sinks:
doris:
type: doris
inputs: ["demo"]
# Target configuration
endpoints:
- "http://doris-fe1:8030"
- "http://doris-fe2:8030"
database: "analytics_db"
table: "user_events"
# Authentication configuration
auth:
strategy: basic
user: "admin"
password: "admin123"
# Batch configuration
batch:
max_events: 100000 # Maximum events per batch
timeout_secs: 30 # Batch timeout in seconds
max_bytes: 1073741824 # Maximum bytes per batch (1GB)
# Custom HTTP headers for Doris Stream Load
headers:
format: "json"
strip_outer_array: "false"
read_json_by_line: "true"
# Additional configuration
label_prefix: "vector"
log_request: true
log_progress_interval: 10
buffer_bound: 1
Environment Setup
- Tested configuration validation against Vector's validation system
- Verified health check functionality (attempts connection to configured endpoints)
- All documentation generation and validation checks pass
- CUE v0.7.0 used for documentation generation
Does this PR include user facing changes?
- [x] Yes. Please add a changelog fragment based on our guidelines.
- [ ] No. A maintainer will apply the "no-changelog" label to this PR.
Notes
Implementation Details
- Stream Load API: Uses Doris's native Stream Load API for optimal performance and compatibility
- Authentication: Supports basic authentication with username/password
- Batching: Configurable batching with event count, byte size, and timeout limits
- Custom Headers: Support for Doris-specific Stream Load parameters via HTTP headers including:
format: Data format specification (json, csv, etc.)read_json_by_line: JSON line-by-line reading modestrip_outer_array: Array handling configurationcolumns: Column mapping specification
- Error Handling: Comprehensive error handling with configurable retry logic
- Health Checks: Validates connectivity and basic authentication
- Rate Limiting: Built-in rate limiting and adaptive concurrency control
Documentation
- Added complete CUE documentation for the sink configuration
- Generated reference documentation automatically using Vector's documentation system
- Updated service definitions and URL references
- All documentation validation checks pass (
CI=true make check-docs)
Dependencies
- No new external dependencies added
- Uses existing Vector HTTP client infrastructure
- Leverages standard Vector authentication, batching, and request frameworks
- Follows Vector's established patterns for sink implementation
Code Quality
- All code formatted with
cargo fmt - Follows Vector's coding standards and patterns
- Proper error handling and logging throughout
- Comprehensive configuration validation
Testing Strategy
- Configuration validation ensures all options are properly parsed
- Health check functionality verified through connection attempts
- Documentation generation confirms all metadata is correctly defined
- Follows Vector's established testing patterns for sinks
References
- Apache Doris Stream Load Documentation: https://doris.apache.org/docs/data-operate/import/import-way/stream-load-manual
- Apache Doris Official Website: https://doris.apache.org
- Vector Sink Development Guidelines: https://vector.dev/docs/reference/configuration/sinks/
Created Jira card for Docs Team review.
Hi @bingquanzhao, thank you for this PR. Please rebase on master and fix merge conflicts. There are 12k affected lines right now.
Hi @bingquanzhao, thank you for this PR. Please rebase on master and fix merge conflicts. There are 12k affected lines right now.
Thanks for your reminder, I have executed rebase master
Still 12k lines added, find the offending commit and revert it or git checkout affected files from master
Still 12k lines added, find the offending commit and revert it or git checkout affected files from master
Thank you for your help. I think I previously executed an incorrect command for generating the document, which resulted in modifying the source files of other parts.
Are there any other contents that need to be adjusted? Please help to check.
When can it be launched for use
Please merge / resolve conflicts with origin/master. The new make fmt should also update the formatting. In the meantime, we will review this PR soon.
Please merge / resolve conflicts with
origin/master. The newmake fmtshould also update the formatting. In the meantime, we will review this PR soon.
I've executed make fmt, and thanks a lot for merging master into this branch!
When packaging after merging, these few need to be changed src/sinks/doris/retry.rs 1、use crate::sinks::util::http::HttpRequest; 2、use crate::sinks::doris::sink::DorisPartitionKey; 3、type Request = HttpRequest<DorisPartitionKey>; 4、fn should_retry_response(&self, response: &Self::Response) -> RetryActionSelf::Request
Hello, I will review this tomorrow but due to the PR size, it's unlikely that we can merge it before the merge window closes.
Hello, I will review this tomorrow but due to the PR size, it's unlikely that we can merge it before the merge window closes.
Thanks. What can I do if the PR has not been reviewed yet when it is closed?
Hello, I will review this tomorrow but due to the PR size, it's unlikely that we can merge it before the merge window closes.
Thanks. What can I do if the PR has not been reviewed yet when it is closed?
Hi @bingquanzhao, apologies for the delay. Several other issues came up. We will do our best to review this soon and include it in the next release. Stay tuned for any review comments. Thanks!
Hi @bingquanzhao, thank you for this PR. I did a review focusing on the config UX first.
Hi @pront ,your review comments were very helpful. I've made some code modifications; please continue to review them.