awesome-data-catalogs
awesome-data-catalogs copied to clipboard
π Awesome Data Catalogs and Observability Platforms.
Awesome Data Discovery and Observability 
This repository contains a curated list of awesome data data catalogs and observability platforms that help you discover, manage, and observe data in your organization.
Contents: Existing Data Discovery and Observability Solutions
OSS Data Catalogs | Proprietary Monocloud DCs | Proprietary Obserability Tools | Other Proprietary DCs |
---|---|---|---|
π Amundsen | π Google DC | π Monte Carlo | π Alation |
π DataHub | π Azure DC | π Databand | π Atlan |
π Marquez | π Datafold | π Collibra | |
π Atlas | π Ataccama | π DataGalaxy | |
π CKAN | π Informatica | ||
π Magda | πStemma |
High-Level Feature Comparision
Tool | Specification -Based | Search-based | Network-based | Lineage-based | Federa- tion | ML 1st Citizen | Data Quality | End-to-end Lineage | Observ- ability |
---|---|---|---|---|---|---|---|---|---|
Alation | β | βοΈ | β | βοΈ | β | β | βοΈ | β | β |
Amundsen | β | βοΈ | βοΈ | βοΈ | β | β | β | β | β |
Ataccama | β | βοΈ | β | βοΈ | β | β | βοΈ | β | β |
Atlan | β | βοΈ | β | βοΈ | β | β | βοΈ | β | β |
Atlas | β | βοΈ | β | βοΈ | β | β | β | β | β |
Azure DC | β | βοΈ | ? | βοΈ | β | β | ? | β | β |
CKAN | β | βοΈ | β | β | βοΈ | β | β | β | β |
Collibra | β | βοΈ | ? | βοΈ | β | β | ? | β | β |
DataGalaxy | β | βοΈ | βοΈ | βοΈ | β | β | β | βοΈ | βοΈ |
Databand | β | ? | ? | ? | β | ? | ? | ? | βοΈ |
Datafold | β | βοΈ | βοΈ | βοΈ | β | β | βοΈ | β | βοΈ |
DataHub | β | βοΈ | βοΈ | βοΈ | β | β | β | β | β |
Google DC | β | βοΈ | β | βοΈ | β | β | ? | β | β |
Informatica | β | βοΈ | βοΈ | βοΈ | β | β | βοΈ | β | β |
Magda | β | βοΈ | β | β | βοΈ | β | β | β | β |
Marquez | OpenLineage | βοΈ | β | βοΈ | ? | β | β | β | β |
Monte Carlo | β | βοΈ | β | βοΈ | β | β | βοΈ | β | βοΈ |
Stemma | β | βοΈ | βοΈ | βοΈ | β | β | ? | β | β |
Talend | β | βοΈ | ? | βοΈ | β | β | βοΈ | β | β |
Definitions:
- Specification-based - uses an open standard for collecting metadata to allow efficient time-to-discovery and federating data catalogs
- Search-based - allows to search for data assets
- Lineage-based - provides lineage for all entities the solution operates
- Network-based - provides rich context about data asset ownership
- Federation - the ability to map multiple data catalogs into a single UI to avoid repeated data collection.
- End-to-end lineage - data lineage that includes all data assets used in the organization across all its data catalogs and ML tools.
- ML 1st citizen - operates ML entities on a high level - you can use them as any other data assets.
- Data Quality - includes mature data quality assurance tools.
π Open-Source Data Catalogs
Amundsen
A popular open-source data catalog for metadata management and data discovery originated from Lyft.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | βοΈ | βοΈ | β | β | β | β | β |
More features
- Strategy: Push
- UX personalization: No
- AI autowiring: No
- Rich data profiling: No
- Recommendations: Yes
- Schemas, Description: Yes
- Complex schemas: No
- Data preview: Yes
- Column statistics: Yes
- Data owner: Yes
- Top data users: Yes
- Change notifications:No
- Change feed: No
- Deployment:
- Supported data sources: Hive, Redshift, Druid, RDBMS, Presto, Snowflake
DataHub
DataHub is an open-source data catalog featuring data discovery, data governance, metadata management originated from LinkedIn.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | βοΈ | βοΈ | β | β | β | β | β |
More features
- Strategy: Push, Pull
- UX personalization: No
- AI autowiring: No
- Rich data profiling: No
- Recommendations: ?
- Schemas, Description: Yes
- Complex schemas: No
- Data preview: ?
- Column statistics: No
- Data owner: Yes
- Top data users: ?
- Change notifications: No
- Change feed: No
- Deployment:
- Supported data sources: Hive, Kafka, RDBMS
Marquez
Marquez is an open-source data catalog for collection, aggregation, and visualization of a data ecosystemβs metadata originated from WeWork.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
OpenLineage | βοΈ | β | βοΈ | ? | β | β | β | β |
More features
- Strategy: Push
- UX personalization: No
- AI autowiring: No
- Rich data profiling: No
- Recommendations: No
- Schemas, Description: Yes
- Complex schemas: No
- Data preview: Yes
- Column statistics: No
- Data owner: Yes
- Top data users: ?
- Change notifications: No
- Change feed: No
- Deployment:
- Supported data sources: S3, Kafka
Atlas
Apache Atlas is an open-source data catalog for metadata collection, governance, and data democratization.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | β | βοΈ | β | β | β | β | β |
More features
- Strategy: Push
- UX personalization: No
- AI autowiring: No
- Rich data profiling: No
- Recommendations: No
- Schemas, Description: Yes
- Complex schemas: No
- Data preview: No
- Column statistics: No
- Data owner: No
- Top data users: ?
- Change notifications: Yes
- Change feed: No
- Deployment:
- Supported data sources:HBase, Hive, Sqoop, Kafka, Storm
CKAN
CKAN is an open-source data catalog for data management, powering data portals for govenments and enterprises.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | β | β | βοΈ | β | β | β | β |
More features
- Strategy: Push
- UX personalization: No
- AI autowiring: No
- Rich data profiling: No
- Recommendations: ?
- Schemas, Description: ?
- Complex schemas: ?
- Data preview: ?
- Column statistics: ?
- Data owner: ?
- Top data users: ?
- Change notifications: ?
- Change feed: ?
- Deployment:
- Supported data sources:
Magda
Magda is an open-source data catalog that features data discovery, metadata enrichment, and federation, focused on geodata.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | β | β | βοΈ | β | β | β | β |
More features
- Strategy: Push via UI
- UX personalization: No
- AI autowiring: No
- Rich data profiling: No
- Recommendations: No
- Schemas, Description: Yes
- Complex schemas: No
- Data preview: Yes
- Column statistics: No
- Data owner: Yes
- Top data users: ?
- Change notifications: No
- Change feed: No
- Deployment:
- Supported data sources: Mostly geodata
π Proprietary Data Catalogs
Collibra
Collibra is an enterprise data catalog that helps to discover and understand data that matters and drive impactful insights from it.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | ? | βοΈ | β | β | ? | β | β |
More features
- Strategy: Push
- UX personalization: Yes
- AI autowiring: ?
- Network-based: No
- Rich data profiling: ?
- Supported data sources:
Informatica
Informatica is an enterprise data catalog that provides AI-powered data discovery engine to scan and catalog data assets.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | βοΈ | βοΈ | β | β | βοΈ | β | β |
More features
- Strategy: Push
- UX personalization: ?
- AI autowiring: ?
- Network-based: Yes
- Rich data profiling: Yes
- Supported data sources:
Alation
Alation is a collaborative data catalog that helps companies to drive value and business impact from their data.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | β | βοΈ | β | β | βοΈ | β | β |
More features
- Strategy: Push
- UX personalization: Yes
- AI autowiring: No
- Network-based: No
- Rich data profiling: No
- Supported data sources:
Atlan
Atlan is a modern data catalog offering data discovery, data profiling, data quality, data lineage and data governance.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | β | βοΈ | β | β | βοΈ | β | β |
More features
- Strategy: Pull
- UX personalization: ?
- AI autowiring: ?
- Network-based: No
- Rich data profiling: ?
- Supported data sources: Presto, Deequ, Atlas, Airflow, Hudi
DataGalaxy
DataGalaxy is a modern data catalog offering data discovery, data profiling, data quality, data lineage and data governance.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | βοΈ | βοΈ | β | β | β | βοΈ | βοΈ |
More features
- Strategy: Pull & Push
- UX personalization: Yes
- AI autowiring: Yes
- Network-based: Yes
- Rich data profiling: Yes
- Supported data sources:
Stemma
Stemma is a fully managed data catalog powered by the open-source data catalog Amundsen that helps data teams have total trust in their data.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | βοΈ | βοΈ | β | β | ? | β | β |
More features
- Strategy: Push
- UX personalization: No
- AI autowiring: No
- Network-based: No
- Rich data profiling: No
- Supported data sources:
Talend
Talend is a data catalog that helps enterprises power critical business descisions with trusted data.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | ? | βοΈ | β | β | βοΈ | β | β |
More features
- Strategy: Push
- UX personalization: Yes
- AI autowiring: ?
- Network-based: ?
- Rich data profiling: Yes
- Supported data sources:
π Monocloud Data Catalogs
Google Cloud Data Catalog
Google Cloud Data Catalog is a fully managed, scalable metadata management service in Google Cloud's Data Analytics family of products.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | β | βοΈ | β | β | ? | β | β |
More features
- Strategy: Pull
- UX personalization: ?
- AI autowiring: ?
- Network-based: No
- Rich data profiling: No
- Supported data sources:
Azure Data Catalog
Azure Data Catalog is a fully managed, enterprise-wide metadata catalog that makes data asset discovery straightforward.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | ? | βοΈ | β | β | ? | β | β |
More features
- Strategy: Pull
- UX personalization: ?
- AI autowiring: ?
- Network-based: ?
- Rich data profiling: ?
- Supported data sources:
π Data Observability Platforms
Monte Carlo
Monte Carlo is a data observability tool that helps to increase trust in data by eliminating or preventing data downtime.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | β | βοΈ | β | β | βοΈ | β | βοΈ |
More features
- Strategy: Pull
- UX personalization: ?
- AI autowiring: ?
- Network-based: ?
- Rich data profiling: ?
- Supported data sources: Snowflake, Hive, Kafka, Looker, Redshift, Tableau, Big Query, Airflow, Fivetran, Presto, Mode, Periscope, Databricks, Glue, dbt, Chartio, Spark, AWS, S3, data.world, Google Cloud Platform
Databand
Databand is an observability platform that helps data engineers identify and troubleshoot pipeline issues and data quality problems.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | ? | ? | ? | β | ? | ? | ? | βοΈ |
More features
- Strategy: Push
- UX personalization: ?
- AI autowiring: ?
- Network-based: ?
- Rich data profiling: ?
- Supported data sources:
Datafold
Datafold is a data monitoring and observability platform that gives you confidence in your data quality through diffs, profiling, and anomaly detection.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | βοΈ | βοΈ | β | β | βοΈ | β | βοΈ |
More features
- Strategy: Push
- UX personalization: ?
- AI autowiring: ?
- Network-based: ?
- Rich data profiling: ?
- Supported data sources:
Ataccama
Ataccama is an enterprise data catalog and observability tool featuring data profiling and data quality management, designed for data professionals.
Based on Open Standard | Search-based | Network-based | Lineage-based | Federation | ML 1st Citizen | Data Quality | End-to-end Lineage | Observability |
---|---|---|---|---|---|---|---|---|
β | βοΈ | β | βοΈ | β | β | βοΈ | β | β |
More features
- Strategy: Pull
- UX personalization: Yes
- AI autowiring: No
- Network-based: No
- Rich data profiling: Yes
- Supported data sources:
Back to top