snowflake-connector-python icon indicating copy to clipboard operation
snowflake-connector-python copied to clipboard

SNOW-590041: Support a Distributed Write API (inverse of distributed fetch)

Open njriasan opened this issue 2 years ago • 0 comments

What is the current behavior?

When you have data that is located on multiple different machine in memory, writing this data to Snowflake requires intermediate steps to perform the write in a single transaction. The crucial limiting factor (as I see it) is that each connection is assigned a unique ID for transactions, so if you want to group a set of writing into the same transaction, the request must be sent by a single connection. Here are the options that I currently see available:

Write to intermediate storage

One option is that you could write to disk "somewhere". For example, you could write the data into cloud storage with S3 and then stage those files into Snowflake. This may work well if you already have s3 setup, but it add an intermediate step to configure and may impact performance.

Gather the data onto 1 node

Another option without going to disk would be to gather all of the data onto a single node and then do the write from a single connection. However, engines that use distributed data often do so because the amount of data being processed may exceed the memory available on a single machine. To get around this you would need to do some form of pipelining, which would likely be very slow.

Do multiple transactions

The third option I see if a system could separate tasks into separate transactions and just do one connection per node/task. In addition to not satisfying many user's expectations, this would also likely impact performance as now each transaction may need rollbacks to order the transactions.

If I am missing any other options to do a proper distributed write please let me know, especially if you have any suggestions on the best way to do this right now.

What is the desired behavior?

I would like the Snowflake connector to be able to perform a true distributed write, whereby a series of connections could do inserts that all function as part of the same transaction. To do this, I would like to see the following behavior:

  • Multiple connections can participate as part of the same transaction.
  • A single commit is used to write all of the data in the distributed write transaction.
  • Rolling back the transaction rolls back the writes from all connections in this transaction.
  • Writes from different sources should be truly parallel so long as they are purely appends. Some work would still be expected to match sort/clustering requirements
  • No need for users to gather data onto a single machine or storage

I think there are various ways that this can be implemented into a user friendly solution.

How would this improve snowflake-connector-python?

I think this would improve the performance capabilities of writing back to Snowflake from distributed compute engines. In addition, I think this would improve user experience as it may simplify the amount of resources needed to successfully write to Snowflake. This would be essential for the performance of compute engines like Bodo and Snowflake.

njriasan avatar May 13 '22 14:05 njriasan