dataproc-templates
dataproc-templates copied to clipboard
Dataproc templates and pipelines for solving simple in-cloud data tasks
Dataproc Templates
Dataproc templates are designed to address various in-cloud data tasks, including data import/export/backup/restore and bulk API operations. These templates leverage the power of Google Cloud's Dataproc, supporting both Dataproc Serverless and Dataproc clusters.
Google provides this collection of pre-implemented Dataproc templates as a reference and for easy customization. (Video Link)
Dataproc Templates (Java - Spark)
Please refer to the Dataproc Templates (Java - Spark) README for more information
- BigQueryToGCS (blogpost link)
- BigQueryToJDBC (blogpost link)
- CassandraToBigQuery (blogpost link)
- CassandraToGCS (blogpost link)
- DataplexGCStoBQ(blogpost link)
- GCSToBigQuery (blogpost link)
- GCSToBigTable (blogpost link) (Video link)
- GCSToGCS (blogpost link)
- GCSToJDBC (blogpost link)
- GCSToMongo (blogpost link)
- GCSToSpanner (blogpost link)
- GeneralTemplate
- HBaseToGCS(blogpost link)
- HiveToBigQuery(blogpost link)
- HiveToGCS (blogpost link)
- JDBCToBigQuery (blogpost link)
- JDBCToGCS (blogpost link)
- JDBCToJDBC
- JDBCToSpanner
- KafkaToBQ (blogpost link)
- KafkaToBQDstream
- KafkaToGCS (blogpost link)
- KafkaToGCSDstream
- KafkaToPubSub
- MongoToBQ
- MongoToGCS (blogpost link)
- PubSubToBigQuery (blogpost link)
- PubSubToBigTable (blogpost link)
- PubSubLiteToBigTable (blogpost link) Deprecated and will be removed in Q1 2025
- PubSubToGCS (blogpost link)
- RedshiftToGCS (blogpost Link) Deprecated and will be removed in Q1 2025
- S3ToBigQuery (blogpost link)
- SnowflakeToGCS (blogpost link)
- SpannerToGCS (blogpost link)
- TextToBigquery Deprecated and will be removed in Q1 2025
- WordCount
Dataproc Templates (Python - PySpark)
Please refer to the Dataproc Templates (Python - PySpark) README for more information
- AzureBlobToBigQuery
- BigQueryToGCS (blogpost link)
- CassandraToBigquery
- CassandraToGCS (blogpost link)
- ElasticsearchToBigQuery
- ElasticsearchToBigtable
- ElasticsearchToGCS
- GCSToBigQuery (blogpost link)
- GCSToBigTable(blogpost link)
- GCSToGCS (blogpost link)
- GCSToJDBC (blogpost link)
- GCSToMongo (blogpost link)
- HbaseToGCS (blogpost link)
- HiveToBigQuery (blogpost link)
- HiveToGCS (blogpost link)
- JDBCToBigQuery (blogpost link)
- JDBCToGCS (blogpost link)
- JDBCToJDBC (blogpost link)
- KafkaToGCS (Blogpost link)
- KafkaToBigQuery (Blogpost link)
- MongoToBigQuery
- MongoToGCS (blogpost link)
- PubSubLiteToBigtable Deprecated and will be removed in Q1 2025
- RedshiftToGCS (blogpost link) Deprecated and will be removed in Q1 2025
- S3ToBigQuery
- SnowflakeToGCS (blogpost link)
- TextToBigQuery (blogpost link) Deprecated and will be removed in Q1 2025
Dataproc Templates (Notebooks)
Please refer to the Dataproc Templates (Notebooks) README for more information
- HiveToBigQuery (blogpost link)
- MsSqlToBigQuery(blogpost link)
- MySQLToSpanner (blogpost link)
- SQLServerToPostgres
- OracleToBigQuery(blogpost link)
- OracleToPostgres(blogpost Link)
- OracleToSpanner (blogpost Link)
Getting Started
-
Clone this repository
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
-
Obtain authentication credentials
Create local credentials by running the following command and following the oauth2 flow (read more about the command here.
gcloud auth application-default login
Or manually set the
GOOGLE_APPLICATION_CREDENTIALS
environment variable to point to a service account key JSON file path.Learn more at Setting Up Authentication for Server to Server Production Applications.
Note: Application Default Credentials is able to implicitly find the credentials as long as the application is running on Compute Engine, Kubernetes Engine, App Engine, or Cloud Functions.
-
Executing a Template
Follow the specific guide, depending on your use case:
- Dataproc Templates (Java - Spark)
- Dataproc Templates (Python - PySpark)
Flow diagram
Below flow diagram shows execution flow for Dataproc Templates:
Contributing
See the contributing instructions to get started contributing.
License
All solutions within this repository are provided under the Apache 2.0 license. Please see the LICENSE file for more detailed terms and conditions.
Disclaimer
This repository and its contents are not an official Google Product.
Contact
Share your feedback, ideas, thoughts feedback-form
Questions, issues, and comments should be directed to [email protected]