awesome-dataops
awesome-dataops copied to clipboard
Awesome list of dataops products, open source and resources
Awesome DataOps 
Awesome list of DataOps open source software, online services, courses and use cases
Table of contents
- Opensource
- Commercial products and services
Opensource
Data Pipeline Orchestration
- Apache Airlow - Airflow is a platform created by the community to programmatically author, schedule and monitor workflows.
- Apache Oozie - Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
- Dagster - A Python library for building data applications: ETL, ML, Data Pipelines, and more.
- DBT Cmd tool - the T in ELT. Organize, cleanse, denormalize, filter, rename, and pre-aggregate the raw data in your warehouse so that it's ready for analysis.
- Reflow - A language and runtime for distributed, incremental data processing in the cloud
ETL tools
- Apache Kafka - a distributed streaming platform.
- Apache Nifi - an easy to use, powerful, and reliable system to process and distribute data.
- Squirrel - a Python library for large-scale data loading, transforming and sharing.
Commercial products and services
Platforms
- Astronomer - spin up and scale Apache Airflow clusters
- Databand - Databand tracks your pipeline execution metadata, so you can evaluate changes in runtimes, code, data, and critical business KPIs.
- DataKitchen - end-to-end DataOps platform automates and coordinates all the people, tools, and environments in your entire data analytics organization – everything from orchestration, testing, and monitoring to development and deployment.
- Prefect - is a new workflow management system, designed for modern infrastructure and powered by open-source software.
- Saagie - Saagie DataOps Orchestrator integrates the commercial and open source data technologies to accelerate project delivery
- Unravel - helps ops engineers, app developers, and enterprise architects reduce the complexity of delivering reliable application performance – providing unified visibility and operational intelligence to optimize your entire ecosystem
Cloud ETL
- AWS Glue - is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores.
- Azure Data Factory - a hybrid data integration service, simplified ETL operations
- Google Cloud Dataflow - unified stream and batch data processing that's serverless, fast, and cost-effective.
- ETLWorks - a cloud-first, any-to-any data integration platform
Data catalogs
- Alation Data Catalog - a data catalog designed for human collaboration
- Colibra Data Catalog - empowers business users to quickly discover and understand data that matters
- SQL Data catalog - tool to discover and classify sensitive data for MS SQL Server
Testing and monitoring
- RightData - is a data testing, reconciliation, validation suite that allows stakeholders in identifying issues related to data consistency, quality, completeness, and gaps.