delta-sharing
delta-sharing copied to clipboard
Support writing to Delta Shares
Since the underlying method of sharing out data uses secure signed URLs, the same process can be used in reverse to collect files for ingestion. Leveraging an extended delta sharing protocol would allow secure, unlimited bandwidth fast uploads of data, from any platform to any cloud object store. Signed URLs can be used by curl, python requests module and any other http library.
This could be used by public health entities, SaaS/IoT companies, Pharma/ Research, Retailers, gas/oil companies to collect data from far flung locales for large scale, traceable, secure streaming data ingestion scenarios.
May need to limit semantics to Append only writes to the underlying share.
This is a great feature. But it may be hard to support writing because writing also requires to update Delta's transaction logs, which is not as simple as sending a pre-signed url to the client. Currently we will focus on reading first. But in the long term, we definitely will think about how to support writing.
Hi catching up on this one to explain a use case we would have with getting data ingested.
we are a medium sized mostly cloud native company within a larger ecosystem which is much slower moving in terms of technology. but our services are highly interconnected. therefore we often not only provide but also receive data. to create a receiving workflow we usually have some overhead of how to get a "i am used to ftp" client to put some data into s3. if delta sharing protocol could provide an easy to use (as in compatible to retro technology...) unified bidirectional interface that would be greatly beneficial.
It's a very interesting topic and I think we basically need a staging data store and transaction protocol for supporting atomic commit.
- First, API should provide write-only pre-signed URLs (e.g., S3 pre-signed URLs) for uploading data. The uploaded data can be stored to the staging area (e.g., S3 bucket)
- The client upload the data (e.g., Parquet) to the staging area through the pre-signed URLs.
- After finishing the upload, the client send a commit request to make the staging data available in the underlying storage format (e.g., Delta). This step may require copying data from staging to delta, or another S3 bucket.
We have implemented this type of protocol in Treasure Data spark integration (https://treasure-data.github.io/td-spark/td-spark.html#writing-tables, illustration: https://www.slideshare.net/taroleo/tdspark-internals-extending-spark-with-airframe-spark-meetup-tokyo-3-2020#7), and it scales well for uploading billions of DataFrame records in parallel as we can leverage the scalability of the S3. The transaction manager only needs to commit a small set of files listing S3 object paths.
A challenge of this approach would be managing transaction states somewhere because we must avoid committing the same transaction more than once. We are using RDMBS for managing such transaction states.
Append-only writes usually cause no conflict unless the target table is deleted, so there would be a way to support atomic commit (e.g., adding a new set of partitions and increment metadata version) at the REST API server side.
Do you think there will be support for writing data using Delta Sharing in the near future?