feat: transform awesome-data-engineering into definitive 2024-2025 resource
Major improvements:
README Transformation:
- Reorganized by data lifecycle (ingestion → storage → transformation → orchestration → processing → quality → governance → activation → visualization)
- Fixed all broken markdown syntax (removed spaces in link formatting)
- Added modern data stack tools (2020-2025):
- Data Ingestion: Airbyte, Meltano, dlt, Redpanda
- Data Transformation: dbt, SQLMesh, Polars
- Orchestration: Dagster, Prefect, Kestra, Mage
- Data Lakes: Apache Iceberg, Delta Lake, Apache Hudi, XTable
- Lakehouse: Unity Catalog, Apache Polaris, Nessie
- Data Quality: Great Expectations, Soda, elementary-data
- Data Observability: Monte Carlo, OpenMetadata
- Data Catalogs: DataHub, OpenMetadata, Amundsen
- Reverse ETL: Census, Hightouch, Grouparoo
- Semantic Layer: Cube, dbt Semantic Layer
- Embedded Analytics: DuckDB, MotherDuck
- Added new critical categories:
- Data Quality & Observability
- Data Discovery & Governance
- Reverse ETL
- Cloud Data Warehouses (separated from general storage)
- Data Lakes & Lakehouses (with table formats)
- Semantic Layer / Metrics Layer
- Enhanced all descriptions to be action-oriented and clear
- Improved visual hierarchy with proper heading structure
- Updated cloud data warehouses section (Snowflake, BigQuery, Databricks SQL, etc.)
- Added modern serialization formats (Arrow, MessagePack, FlatBuffers)
- Expanded time-series databases (TimescaleDB, QuestDB, VictoriaMetrics)
- Updated streaming section with modern tools (RisingWave, ksqlDB, Materialize)
- Added dashboarding frameworks (Streamlit, Dash, Gradio, Panel)
- Refreshed infrastructure section with modern IaC and monitoring tools
- Added table of contents with proper anchor links
- Removed outdated or deprecated tools
- Added "Last updated" timestamp
Contributing Guidelines Enhancement:
- Established clear philosophy of curation over comprehension
- Defined quality standards for tool inclusion
- Added format requirements with good/bad examples
- Created detailed submission guidelines
- Specified what to include vs. what to exclude
- Outlined PR process and quality review criteria
- Added guidance on updating existing entries
Impact: This transforms the list from a dated collection into the definitive, well-curated resource for data engineers in 2024-2025. Every tool is production-ready, actively maintained, and represents current best practices.
Hi @duyet, Thank you for the contribution. I like where you are wanting to take this repo, but it does feel like a large jump. There are also conflicts that need to be resolved before I can merge. If you would please explain more about your intentions and desired end result I think we can get to a point where this makes the repo better. If this MR was auto generated and you don't have a stake in it then I will likely take some of these ideas and implement them manually.
@igorbarinov Do you have any opinions here?