vector icon indicating copy to clipboard operation
vector copied to clipboard

Add support for Databend sink

Open everpcpc opened this issue 2 years ago • 6 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

sink logs to databend

Attempted Solutions

Http sink can do the trick, but it's too complicated.

Proposal

We need to do some further improvement for the cloud database, such as using presigned url for upload to save most of the transfer fee.

References

  • https://github.com/datafuselabs/databend

Version

0.26.0

everpcpc avatar Dec 26 '22 02:12 everpcpc

FWIW, Databend seems to suggest using the clickhouse sink - https://databend.rs/doc/integrations/data-tool/vector#configure-vector

We have been told about some performance limitations of that sink as-is, so it's worth investigating what a native sink may look like.

spencergilbert avatar Jan 04 '23 18:01 spencergilbert

We intend to do a 3-step batch sink:

  1. Get a presigned url for object storage before a batch by the query:
    PRESIGN UPLOAD @stage_name/stage_path;
    
  2. Format data into csv/ndjson or some else, and upload directly into object storage with the presigned url.
  3. Insert with the uploaded file with stage attachment in previous step:
    https://databend.rs/doc/sql-commands/dml/dml-insert#insert-with-stage-attachment (Maybe this step can be replaced by just trigger the prebuilt insert pipeline or background worker)

BTW, direct insert mode would also be supported, and configurable via settings.

everpcpc avatar Jan 05 '23 07:01 everpcpc

Haha, I should have checked to see if you were part of @datafuselabs/databend before responding 😆

I'm definitely not familiar enough with the application to have strong opinions of how to implement it - I expect we'd lean on y'alls expertise for what's the most performant/reliable/has-whatever-features-we-need and go that way.

Since object storage would be involved with the 3-step, it could be an opportunity to spike/explore using OpenDAL (https://github.com/vectordotdev/vector/issues/15715)

spencergilbert avatar Jan 05 '23 14:01 spencergilbert

Actually object storage is not involved with the 3-step. We get a presigned url with step 1, which is generated by OpenDAL inside Databend, and makes step 2 just a http upload, which could be more faster than inserting into databases, and with no need for OpenDAL on vector side.

ref: https://docs.aws.amazon.com/AmazonS3/latest/userguide/PresignedUrlUploadObject.html

everpcpc avatar Jan 06 '23 01:01 everpcpc

Oh, I see - that seems handy!

I saw that https://github.com/datafuselabs/databend/issues/9448 included "Datadog Vector integrate with Rust-driver", is this issue/work something Databend is considering contributing?

spencergilbert avatar Jan 09 '23 16:01 spencergilbert

Yes, I'm working on this these days.

everpcpc avatar Jan 10 '23 05:01 everpcpc

Hi @everpcpc! We've been taking a look at this request, and we're wondering if you could provide us with more details about the issues you're facing with Vector that require a new sink. Mainly, it would be great to have some background on what makes the HTTP or Clickhouse workarounds too complicated, then also some context on what the exact bandwidth concerns the existing sinks face. We're happy to extend Vector's surface area to include new projects, but we're also being careful about increasing the project's surface area if workarounds already exist. Sorry about chiming in so late in the game, and thanks in advance! 😸

davidhuie-dd avatar Feb 10 '23 21:02 davidhuie-dd

hi @davidhuie-dd, As a cloud warehouse, we are mostly handling large amounts of data, and the transfer fee can be extremely high with direct insert into the database. So we take advantage of the s3 pre-signed url, which is commonly used by cloud warehouse providers such as Snowflake. Since the S3 upload is all free, with the help of pre-signed url, we can directly upload data into s3 for the database, and little network transfer on public network to the database. Even in a private VPC network, this feature also helps since cross AZ transfer fee can still be incredibly high. Neither http nor clickhouse can do this now, so a new sink is needed.

Also, with pre-signed insert, we are able to do insert with cluster, not the single instance that handles the insert statement, which could gain much more performance.

Besides, for the later CSV sink format, neither http nor clickhouse can be easily adopted since we need to configure the exact sink fields and generate corresponding insert SQL statement.

everpcpc avatar Feb 12 '23 04:02 everpcpc

@everpcpc For documentation purposes: since ingress bandwidth is free on AWS, this is for saving egress bandwidth costs? It seems like it would help when traffic is between a Databend client and server within the same region, but within different AZs. That would make the bandwidth cost free. Thanks.

davidhuie-dd avatar Feb 13 '23 22:02 davidhuie-dd

@davidhuie-dd some additional notes:

  1. The ingress-bandwith-free only applies to EC2 machine with an external IP address. In a common enterprise setup, people usually have a load balancer and NAT gateway, that could cost even more than data transfer, both ingress and egress.
  2. Yes, in real production cases, people would mostly like their business to be fault tolerance with available zones. And cross-AZ deployment is also recommended by all cloud providers. So there are always logs written from everywhere.

everpcpc avatar Feb 14 '23 00:02 everpcpc