kvrocks icon indicating copy to clipboard operation
kvrocks copied to clipboard

Add support of bulk load for the string like HBase bulkload

Open git-hulk opened this issue 1 year ago • 19 comments

Motivation

Many scenarios need to bulk-load mass data regularly, and it may bring heavy workload and latency spike if loads through the API interface. So it will be better if we can offer a way to mitigate this issue.

Solution

We can use RocksDB Ingest SST to bulk load those data and support for simple strings only.

see more discussions in https://github.com/apache/kvrocks/discussions/1628

git-hulk avatar Mar 08 '23 14:03 git-hulk

Thanks for proposing this. +1 for this feature.

zuston avatar Mar 09 '23 03:03 zuston

I'm willing to submit a PR!

ColinChamber avatar Apr 03 '23 03:04 ColinChamber

@ColinChamber Assigned.

git-hulk avatar Apr 03 '23 03:04 git-hulk

@git-hulk @ColinChamber Thanks for this PR , Is there any progress?looking forward to this bulkload function

liucyao1990 avatar Jun 08 '23 09:06 liucyao1990

Recently I haven't had enough time. Looking forward to others to achieve it. Unassigned. @liucyao1990

ColinChamber avatar Jun 09 '23 05:06 ColinChamber

Thanks @ColinChamber for your update.

git-hulk avatar Jun 09 '23 05:06 git-hulk

@git-hulk For this feature, we need provide a command to load data, or provide a tool?

In my opinion, there are two steps here.

  1. Create SST files with the data.
  2. Ingest the SST files.

The second step requires stopping the world.

Do we need to support online bulk load? Will there be problems with stopping the world?

jihuayu avatar Jun 15 '23 13:06 jihuayu

In my opinion, there are two steps here. Create SST files with the data. Ingest the SST files.

@jihuayu Yes, you're right. And I think it's good to only support the string type first.

Do we need to support online bulk load? Will there be problems with stopping the world?

My intuitive thought is yes for the online bulk load, even though it will block the write operations when ingesting SSTs.

For this feature, we need provide a command to load data, or provide a tool?

From my side, I would like to support loading the local SSTs via command and also provides a tool to generate SST files. For the tool input file, we can require users to put their data in a specified format like CSV or others.

git-hulk avatar Jun 15 '23 16:06 git-hulk

@git-hulk Ok, I'm willing to submit a PR!

jihuayu avatar Jun 17 '23 00:06 jihuayu

Thanks @jihuayu, assigned.

@zuston @liucyao1990 Also welcome to provide more input about how to use the bulk load.

git-hulk avatar Jun 17 '23 11:06 git-hulk

@git-hulk @jihuayu Hi, here is the bulk load ingestion implementation of Pegasus. https://github.com/apache/incubator-pegasus/pulls?q=label%3Acomponent%2Fbulk_load+. FYI

liucyao1990 avatar Jun 19 '23 02:06 liucyao1990

@git-hulk @jihuayu Hi, here is the bulk load ingestion implementation of pegasus. https://github.com/apache/incubator-pegasus/pulls?q=label%3Acomponent%2Fbulk_load+. FYI

Cool, thanks for your input.

git-hulk avatar Jun 19 '23 02:06 git-hulk

I will first create the SST generation tool. we have cluster and replication mode, Ingest SST may be different. I think I can first support Ingest in standalone mode.

jihuayu avatar Jun 24 '23 03:06 jihuayu

Yes, that's right. It's good to NOT support the replication for now.

git-hulk avatar Jun 24 '23 04:06 git-hulk

Are there any updates here?

JackyYangPassion avatar Apr 09 '24 07:04 JackyYangPassion

@JackyYangPassion No. Do you want to have a try?

jihuayu avatar Apr 10 '24 00:04 jihuayu

@JackyYangPassion No. Do you want to have a try?

Okk, I've been researching how to generate SST files recently.

I looked carefully discussions in https://github.com/apache/kvrocks/discussions/1628

Initially, this function only supports String type?

JackyYangPassion avatar Apr 12 '24 01:04 JackyYangPassion

@JackyYangPassion Yes, we would like to support the string first since it's the simplest one. And it's definitely great if can involve other data types.

git-hulk avatar Apr 12 '24 01:04 git-hulk

@JackyYangPassion Thank you! Supporting strings is our first step in the plan. We want to start by creating a basic version to provide to users for their use. This way, we can gather feedback from users on the functionality as early as possible. In the later stages, we will support more types and functionalities.

jihuayu avatar Apr 12 '24 02:04 jihuayu