sling-cli icon indicating copy to clipboard operation
sling-cli copied to clipboard

Allow parquet and compressed csv files in DB->Snowflake replication

Open nixent opened this issue 10 months ago • 4 comments

Feature Description

DB->Snowflake replication processes data in 3 steps:

  • export from the DB to local csv files
  • PUT the csv files to internal stage in Snowflake
  • Insert/Upsert of the records from the csv files

It works fine, however, csv files can get pretty large and the file transfer might take significant time. Suggest adding option to compress the files and/or store them in parquet format.

nixent avatar Jan 07 '25 16:01 nixent

The issue with parquet is that it can take alot of memory, while csv streams. it's worth an experiment. The temp CSVs should be compressed, though (with zstd). Can you confirm that? Do you have any non-parquet suggestions? This is the fastest I can think of to load into snowflake. You can use S3 as a temp storage, but it's not going to speed it up. The other route could be snowpipe streaming, but that's a non-starter as it requires alot of setup.

flarco avatar Jan 07 '25 17:01 flarco

parquet vs csv streaming: it would be nice to let the user to decide if they are willing to allocate more resources to the EL process.

compression: Snowflake supports multiple COMPRESSION formatTypeOptions: COMPRESSION = AUTO | GZIP | BZ2 | BROTLI | ZSTD | DEFLATE | RAW_DEFLATE | NONE So GZIP or ZSTD would work.

resource utilization: I'm wondering if total resource utilization would be different in case of writing csv to disc and compressing it vs writing parquet. Parquet is essentially a csv compressed by columns, thus it should take more or less same resources to produce both, parquet and compressed csv. Limiting number of records per parquet file will with capping memory/cpu utilization, same way as parquet target.

nixent avatar Jan 08 '25 11:01 nixent

it would be nice to let the user to decide if they are willing to allocate more resources to the EL process.

Agreed.

I'm wondering if total resource utilization would be different in case of writing csv to disc and compressing it vs writing parquet.

Perhaps... But yea, just read a thread and once inside the internal staging, parquet will be faster to load by snowflake engine due to columnar/ binary nature

flarco avatar Jan 08 '25 12:01 flarco

@flarco is this feature somewhere around on the roadmap?

nixent avatar Mar 28 '25 07:03 nixent

Implemented for next release. Can be set via target_options.format, for values csv or parquet.

flarco avatar Jun 22 '25 20:06 flarco