Awesome Data Discovery and Observability

This repository contains a curated list of awesome data data catalogs and observability platforms that help you discover, manage, and observe data in your organization.

Contents: Existing Data Discovery and Observability Solutions

OSS Data Catalogs	Proprietary Monocloud DCs	Proprietary Obserability Tools	Other Proprietary DCs
📙 Amundsen	📒 Google DC	🔍 Monte Carlo	📕 Alation
📙 DataHub	📒 Azure DC	🔍 Databand	📕 Atlan
📙 Marquez		🔍 Datafold	📕 Collibra
📙 Atlas		🔍 Ataccama	📕 DataGalaxy
📙 CKAN			📕 Informatica
📙 Magda			📕Stemma

High-Level Feature Comparision

Tool	Specification -Based	Search-based	Network-based	Lineage-based	Federa- tion	ML 1st Citizen	Data Quality	End-to-end Lineage	Observ- ability
Alation	❌	✔️	❌	✔️	❌	❌	✔️	❌	❌
Amundsen	❌	✔️	✔️	✔️	❌	❌	❌	❌	❌
Ataccama	❌	✔️	❌	✔️	❌	❌	✔️	❌	❌
Atlan	❌	✔️	❌	✔️	❌	❌	✔️	❌	❌
Atlas	❌	✔️	❌	✔️	❌	❌	❌	❌	❌
Azure DC	❌	✔️	?	✔️	❌	❌	?	❌	❌
CKAN	❌	✔️	❌	❌	✔️	❌	❌	❌	❌
Collibra	❌	✔️	?	✔️	❌	❌	?	❌	❌
DataGalaxy	❌	✔️	✔️	✔️	❌	❌	❌	✔️	✔️
Databand	❌	?	?	?	❌	?	?	?	✔️
Datafold	❌	✔️	✔️	✔️	❌	❌	✔️	❌	✔️
DataHub	❌	✔️	✔️	✔️	❌	❌	❌	❌	❌
Google DC	❌	✔️	❌	✔️	❌	❌	?	❌	❌
Informatica	❌	✔️	✔️	✔️	❌	❌	✔️	❌	❌
Magda	❌	✔️	❌	❌	✔️	❌	❌	❌	❌
Marquez	OpenLineage	✔️	❌	✔️	?	❌	❌	❌	❌
Monte Carlo	❌	✔️	❌	✔️	❌	❌	✔️	❌	✔️
Stemma	❌	✔️	✔️	✔️	❌	❌	?	❌	❌
Talend	❌	✔️	?	✔️	❌	❌	✔️	❌	❌

Definitions:

Specification-based - uses an open standard for collecting metadata to allow efficient time-to-discovery and federating data catalogs
Search-based - allows to search for data assets
Lineage-based - provides lineage for all entities the solution operates
Network-based - provides rich context about data asset ownership
Federation - the ability to map multiple data catalogs into a single UI to avoid repeated data collection.
End-to-end lineage - data lineage that includes all data assets used in the organization across all its data catalogs and ML tools.
ML 1st citizen - operates ML entities on a high level - you can use them as any other data assets.
Data Quality - includes mature data quality assurance tools.

📙 Open-Source Data Catalogs

Amundsen

Website | GitHub

A popular open-source data catalog for metadata management and data discovery originated from Lyft.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	✔️	✔️	❌	❌	❌	❌	❌

More features

Strategy: Push
UX personalization: No
AI autowiring: No
Rich data profiling: No
Recommendations: Yes
Schemas, Description: Yes
Complex schemas: No
Data preview: Yes
Column statistics: Yes
Data owner: Yes
Top data users: Yes
Change notifications:No
Change feed: No
Deployment:
Supported data sources: Hive, Redshift, Druid, RDBMS, Presto, Snowflake

DataHub

Website | GitHub

DataHub is an open-source data catalog featuring data discovery, data governance, metadata management originated from LinkedIn.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	✔️	✔️	❌	❌	❌	❌	❌

More features

Strategy: Push, Pull
UX personalization: No
AI autowiring: No
Rich data profiling: No
Recommendations: ?
Schemas, Description: Yes
Complex schemas: No
Data preview: ?
Column statistics: No
Data owner: Yes
Top data users: ?
Change notifications: No
Change feed: No
Deployment:
Supported data sources: Hive, Kafka, RDBMS

Marquez

Website | GitHub

Marquez is an open-source data catalog for collection, aggregation, and visualization of a data ecosystem’s metadata originated from WeWork.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
OpenLineage	✔️	❌	✔️	?	❌	❌	❌	❌

More features

Strategy: Push
UX personalization: No
AI autowiring: No
Rich data profiling: No
Recommendations: No
Schemas, Description: Yes
Complex schemas: No
Data preview: Yes
Column statistics: No
Data owner: Yes
Top data users: ?
Change notifications: No
Change feed: No
Deployment:
Supported data sources: S3, Kafka

Atlas

Website | GitHub

Apache Atlas is an open-source data catalog for metadata collection, governance, and data democratization.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	❌	✔️	❌	❌	❌	❌	❌

More features

Strategy: Push
UX personalization: No
AI autowiring: No
Rich data profiling: No
Recommendations: No
Schemas, Description: Yes
Complex schemas: No
Data preview: No
Column statistics: No
Data owner: No
Top data users: ?
Change notifications: Yes
Change feed: No
Deployment:
Supported data sources:HBase, Hive, Sqoop, Kafka, Storm

CKAN

Website | GitHub

CKAN is an open-source data catalog for data management, powering data portals for govenments and enterprises.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	❌	❌	✔️	❌	❌	❌	❌

More features

Strategy: Push
UX personalization: No
AI autowiring: No
Rich data profiling: No
Recommendations: ?
Schemas, Description: ?
Complex schemas: ?
Data preview: ?
Column statistics: ?
Data owner: ?
Top data users: ?
Change notifications: ?
Change feed: ?
Deployment:
Supported data sources:

Magda

Website | GitHub

Magda is an open-source data catalog that features data discovery, metadata enrichment, and federation, focused on geodata.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	❌	❌	✔️	❌	❌	❌	❌

More features

Strategy: Push via UI
UX personalization: No
AI autowiring: No
Rich data profiling: No
Recommendations: No
Schemas, Description: Yes
Complex schemas: No
Data preview: Yes
Column statistics: No
Data owner: Yes
Top data users: ?
Change notifications: No
Change feed: No
Deployment:
Supported data sources: Mostly geodata

📕 Proprietary Data Catalogs

Collibra

Website | GitHub

Collibra is an enterprise data catalog that helps to discover and understand data that matters and drive impactful insights from it.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	?	✔️	❌	❌	?	❌	❌

More features

Strategy: Push
UX personalization: Yes
AI autowiring: ?
Network-based: No
Rich data profiling: ?
Supported data sources:

Informatica

Website | GitHub

Informatica is an enterprise data catalog that provides AI-powered data discovery engine to scan and catalog data assets.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	✔️	✔️	❌	❌	✔️	❌	❌

More features

Strategy: Push
UX personalization: ?
AI autowiring: ?
Network-based: Yes
Rich data profiling: Yes
Supported data sources:

Alation

Website | GitHub

Alation is a collaborative data catalog that helps companies to drive value and business impact from their data.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	❌	✔️	❌	❌	✔️	❌	❌

More features

Strategy: Push
UX personalization: Yes
AI autowiring: No
Network-based: No
Rich data profiling: No
Supported data sources:

Atlan

Website | GitHub

Atlan is a modern data catalog offering data discovery, data profiling, data quality, data lineage and data governance.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	❌	✔️	❌	❌	✔️	❌	❌

More features

Strategy: Pull
UX personalization: ?
AI autowiring: ?
Network-based: No
Rich data profiling: ?
Supported data sources: Presto, Deequ, Atlas, Airflow, Hudi

DataGalaxy

Website | GitHub

DataGalaxy is a modern data catalog offering data discovery, data profiling, data quality, data lineage and data governance.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	✔️	✔️	❌	❌	❌	✔️	✔️

More features

Strategy: Pull & Push
UX personalization: Yes
AI autowiring: Yes
Network-based: Yes
Rich data profiling: Yes
Supported data sources:

Stemma

Website

Stemma is a fully managed data catalog powered by the open-source data catalog Amundsen that helps data teams have total trust in their data.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	✔️	✔️	❌	❌	?	❌	❌

More features

Strategy: Push
UX personalization: No
AI autowiring: No
Network-based: No
Rich data profiling: No
Supported data sources:

Talend

Website | GitHub

Talend is a data catalog that helps enterprises power critical business descisions with trusted data.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	?	✔️	❌	❌	✔️	❌	❌

More features

Strategy: Push
UX personalization: Yes
AI autowiring: ?
Network-based: ?
Rich data profiling: Yes
Supported data sources:

📒 Monocloud Data Catalogs

Google Cloud Data Catalog

Website | GitHub

Google Cloud Data Catalog is a fully managed, scalable metadata management service in Google Cloud's Data Analytics family of products.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	❌	✔️	❌	❌	?	❌	❌

More features

Strategy: Pull
UX personalization: ?
AI autowiring: ?
Network-based: No
Rich data profiling: No
Supported data sources:

Azure Data Catalog

Website

Azure Data Catalog is a fully managed, enterprise-wide metadata catalog that makes data asset discovery straightforward.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	?	✔️	❌	❌	?	❌	❌

More features

Strategy: Pull
UX personalization: ?
AI autowiring: ?
Network-based: ?
Rich data profiling: ?
Supported data sources:

🔍 Data Observability Platforms

Monte Carlo

Website

Monte Carlo is a data observability tool that helps to increase trust in data by eliminating or preventing data downtime.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	❌	✔️	❌	❌	✔️	❌	✔️

More features

Strategy: Pull
UX personalization: ?
AI autowiring: ?
Network-based: ?
Rich data profiling: ?
Supported data sources: Snowflake, Hive, Kafka, Looker, Redshift, Tableau, Big Query, Airflow, Fivetran, Presto, Mode, Periscope, Databricks, Glue, dbt, Chartio, Spark, AWS, S3, data.world, Google Cloud Platform

Databand

Website | GitHub

Databand is an observability platform that helps data engineers identify and troubleshoot pipeline issues and data quality problems.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	?	?	?	❌	?	?	?	✔️

More features

Strategy: Push
UX personalization: ?
AI autowiring: ?
Network-based: ?
Rich data profiling: ?
Supported data sources:

Datafold

Website | GitHub

Datafold is a data monitoring and observability platform that gives you confidence in your data quality through diffs, profiling, and anomaly detection.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	✔️	✔️	❌	❌	✔️	❌	✔️

More features

Strategy: Push
UX personalization: ?
AI autowiring: ?
Network-based: ?
Rich data profiling: ?
Supported data sources:

Ataccama

Website | GitHub

Ataccama is an enterprise data catalog and observability tool featuring data profiling and data quality management, designed for data professionals.

Based on Open Standard	Search-based	Network-based	Lineage-based	Federation	ML 1st Citizen	Data Quality	End-to-end Lineage	Observability
❌	✔️	❌	✔️	❌	❌	✔️	❌	❌

More features

Strategy: Pull
UX personalization: Yes
AI autowiring: No
Network-based: No
Rich data profiling: Yes
Supported data sources:

awesome-data-catalogs
awesome-data-catalogs copied to clipboard

Metadata

Awesome Data Discovery and Observability

Contents: Existing Data Discovery and Observability Solutions

High-Level Feature Comparision