tutorials icon indicating copy to clipboard operation
tutorials copied to clipboard

StreamSets Tutorials

StreamSets DataOps Platform Tutorials

The following tutorials demonstrate features of StreamSets Data Collector, StreamSets Transformer, StreamSets Control Hub and StreamSets SDK For Python.

StreamSets Data Collector -- Basic Tutorials

  • Log Shipping to Elasticsearch - Read weblog files from a local filesystem directory, decorate some of the fields (e.g. GeoIP Lookup), and write them to Elasticsearch.

  • Simple Kafka Enablement using StreamSets Data Collector

  • What’s the Biggest Lot in the City of San Francisco? - Read city lot data from JSON, calculate lot areas in JavaScript, and write them to Hive.

  • Ingesting Local Data into Azure Data Lake Store - Read records from a local CSV-formatted file, mask out PII (credit card numbers) and send them to a JSON-formatted file in Azure Data Lake Store.

  • Working with StreamSets Data Collector and Microsoft Azure - Integrate Azure Blob Storage, Apache Kafka on HDInsight, Azure SQL Data Warehouse and Apache Hive backed by Azure Blob Storage.

StreamSets Data Collector -- Writing Custom Pipeline Stages

  • Creating a Custom StreamSets Origin - Build a simple custom origin that reads a Git repository's commit log and produces the corresponding records.

  • Creating a Custom Multithreaded StreamSets Origin - A more advanced tutorial focusing on building an origin that supports parallel execution, so the pipeline can run in multiple threads.

  • Creating a Custom StreamSets Processor - Build a simple custom processor that reads metadata tags from image files and writes them to the records as fields.

  • Creating a Custom StreamSets Destination - Build a simple custom destination that writes batches of records to a webhook.

We have a DataCollector API Java Docs to share in case of need, please reach out to us if you need them.

StreamSets Data Collector -- Advanced Features

  • Ingesting Drifting Data into Hive and Impala - Build a pipeline that handles schema changes in MySQL, creating and altering Hive tables accordingly.

  • Creating a StreamSets Spark Transformer in Java - Build a simple Java Spark Transformer that computes a credit card's issuing network from its number.

  • Creating a StreamSets Spark Transformer in Scala - Build a simple Scala Spark Transformer that computes a credit card's issuing network from its number.

  • Creating a CRUD Microservice Pipeline - Build a microservice pipeline to implement a RESTful web service that reads from and writes to a database via JDBC.

The Data Collector documentation also includes an extended tutorial that walks through basic Data Collector functionality, including creating, previewing and running a pipeline, and creating alerts.

StreamSets Data Collector -- Kubernetes-based Deployment

  • Kubernetes-based Deployment - Example configurations for Kubernetes-based deployments of StreamSets Data Collector.

StreamSets Control Hub

  • Creating Custom Data Protector Procedure - Create, build and deploy your own custom data protector procedure that you can use as protection method to apply to record fields.

StreamSets Transformer

StreamSets SDK for Python

Common

  • Find SDK methods and fields of an object available - Object examples can be instances of a pipeline or SCH job or a stage under the pipeline.

Control Hub

  • Getting started with StreamSets SDK for Python - Design and publish a pipeline. Then create, start, and stop a job using StreamSets SDK for Python.

  • Jobs related tutorials

    • Sample ways to fetch one or more jobs - Sample ways to fetch one or more jobs.

    • Start a job and monitor that specific job - Start a job and monitor that specific job using metrics and time series metrics.

    • Move jobs from dev to prod using data_collector_labels - Move jobs from dev to prod by updating data_collector label.

    • Generate a report for a specific job - Generate a report for a specific job and then; fetch and download it.

    • See logs for a data-collector where a job is running - Get the DataCollector where a job is running and then see its logs.

  • Pipelines related tutorials

    • Common pipeline methods - Common operations for StreamSets Control Hub pipelines like update, duplicate , import, export.

    • Loop over pipelines and stages and make an edit to stages - When there are many pipelines and stages that need an update, SDK for Python makes it easy to update them with just a few lines of code.

    • Create CI CD pipeline used in demo - This covers the steps to create CI CD pipeline as used in the SCH CI CD demo. The steps include how to add stages like JDBC, some processors and Kineticsearch; and how to set stage configurations. Also shows, the use of runtime parameters.

License

StreamSets Data Collector and its tutorials are built on open source technologies; the tutorials and accompanying code are licensed with the Apache License 2.0.

Contributing Tutorials

We welcome contributors! Please check out our guidelines to get started.