greptimedb icon indicating copy to clipboard operation
greptimedb copied to clipboard

Bulk load for greptimedb

Open killme2008 opened this issue 3 years ago • 3 comments

Bulk load data from sources, such as:

  • csv file
  • json file
  • parquet file
  • other tables
  • mysql table
  • ....

killme2008 avatar Nov 07 '22 00:11 killme2008

I've invested bulk loading parquet files last week. As parquet is our (and the only) native supported format, we only need to supply some manifest and our specific metadata (in persist storage and in meta server) to make parquet files query-able and even writable.

But what about other format like csv or json? They cannot be directly queried (for now). Two approaches I come up with is

  • an offline converter that converts other format into parquet, and ingest the converted parquet file.
  • add support for those formats.

waynexia avatar Nov 07 '22 06:11 waynexia

make parquet files query-able and even writable.

And in a cluster we should have to split the file according to the table's partition rule as well? This is better done in frontend via some custom sql like COPY INTO

And let frontend to deal with more formats like csv or json. We can convert them to parquet internally.

sunng87 avatar Nov 10 '22 16:11 sunng87

And in a cluster we should have to split the file according to the table's partition rule as well?

Yes. We can let frontend preprocess(split) it and upload them all to OSS.

And let frontend to deal with more formats like csv or json. We can convert them to parquet internally.

I also prefer to convert other formats to parquet. Though support them is not complex but considering the possible modification in the future it would be better to unify the format.

waynexia avatar Nov 11 '22 04:11 waynexia

Already implemented in #1038 #1064

killme2008 avatar May 08 '23 07:05 killme2008