awesome-data-engineering icon indicating copy to clipboard operation
awesome-data-engineering copied to clipboard

feat: transform awesome-data-engineering into definitive 2024-2025 resource

Open duyet opened this issue 1 month ago • 2 comments

Major improvements:

README Transformation:

  • Reorganized by data lifecycle (ingestion → storage → transformation → orchestration → processing → quality → governance → activation → visualization)
  • Fixed all broken markdown syntax (removed spaces in link formatting)
  • Added modern data stack tools (2020-2025):
    • Data Ingestion: Airbyte, Meltano, dlt, Redpanda
    • Data Transformation: dbt, SQLMesh, Polars
    • Orchestration: Dagster, Prefect, Kestra, Mage
    • Data Lakes: Apache Iceberg, Delta Lake, Apache Hudi, XTable
    • Lakehouse: Unity Catalog, Apache Polaris, Nessie
    • Data Quality: Great Expectations, Soda, elementary-data
    • Data Observability: Monte Carlo, OpenMetadata
    • Data Catalogs: DataHub, OpenMetadata, Amundsen
    • Reverse ETL: Census, Hightouch, Grouparoo
    • Semantic Layer: Cube, dbt Semantic Layer
    • Embedded Analytics: DuckDB, MotherDuck
  • Added new critical categories:
    • Data Quality & Observability
    • Data Discovery & Governance
    • Reverse ETL
    • Cloud Data Warehouses (separated from general storage)
    • Data Lakes & Lakehouses (with table formats)
    • Semantic Layer / Metrics Layer
  • Enhanced all descriptions to be action-oriented and clear
  • Improved visual hierarchy with proper heading structure
  • Updated cloud data warehouses section (Snowflake, BigQuery, Databricks SQL, etc.)
  • Added modern serialization formats (Arrow, MessagePack, FlatBuffers)
  • Expanded time-series databases (TimescaleDB, QuestDB, VictoriaMetrics)
  • Updated streaming section with modern tools (RisingWave, ksqlDB, Materialize)
  • Added dashboarding frameworks (Streamlit, Dash, Gradio, Panel)
  • Refreshed infrastructure section with modern IaC and monitoring tools
  • Added table of contents with proper anchor links
  • Removed outdated or deprecated tools
  • Added "Last updated" timestamp

Contributing Guidelines Enhancement:

  • Established clear philosophy of curation over comprehension
  • Defined quality standards for tool inclusion
  • Added format requirements with good/bad examples
  • Created detailed submission guidelines
  • Specified what to include vs. what to exclude
  • Outlined PR process and quality review criteria
  • Added guidance on updating existing entries

Impact: This transforms the list from a dated collection into the definitive, well-curated resource for data engineers in 2024-2025. Every tool is production-ready, actively maintained, and represents current best practices.

duyet avatar Nov 16 '25 07:11 duyet

Hi @duyet, Thank you for the contribution. I like where you are wanting to take this repo, but it does feel like a large jump. There are also conflicts that need to be resolved before I can merge. If you would please explain more about your intentions and desired end result I think we can get to a point where this makes the repo better. If this MR was auto generated and you don't have a stake in it then I will likely take some of these ideas and implement them manually.

vordimous avatar Nov 30 '25 13:11 vordimous

@igorbarinov Do you have any opinions here?

vordimous avatar Nov 30 '25 21:11 vordimous