Cookbook icon indicating copy to clipboard operation
Cookbook copied to clipboard

The Data Engineering Cookbook

Data Engineering Cookbook


What is this Book?    How to Contribute    YouTube    Twitter    Amazon Shop


If You Like This Book & Need More Help

Check out my Data Engineering Academy and personal Coaching at LearnDataEngineering.com

Visit learndataengineering.com: Click Here

  • Learn Data Engineering with our online Academy
  • Perfect for becoming a Data Engineer or add Data Engineering to your skillset
  • Proven process based on years of experience and hundreds of hours of personal coaching
  • Prepared courses on the most important fundamentals, tools and platforms plus our
  • Associate Data Engineer Certification
  • Private Slack workgroup with over 500 members

Support This Book For Free!

  • Amazon: Click Here buy whatever you like from Amazon using this link* (Also check out my complete podcast gear and books)

Contents:

  • Introduction
  • Basic Engineering Skills
  • Advanced Engineering Skills
  • Hands On Course
  • Case Studies
  • Best Practices Cloud Platforms
  • 130+ Data Sources Data Science
  • 1001 Interview Questions
  • Recommended Books and Courses
  • How To Contribute
  • Support What You Like
  • Important Links

Full Table Of Contents:

Introduction

  • What is this Cookbook
  • Data Engineer vs Data Scientist
    • Data Engineer
    • Data Scientist
    • Machine Learning Workflow
    • Machine Learning Model and Data
  • My Data Science Platform Blueprint
    • Connect
    • Buffer
    • Processing Framework
    • Store
    • Visualize
  • Who Companies Need

Basic Engineering Skills

  • Learn To Code
  • Get Familiar With Git
  • Agile Development
    • Why is agile so important?
    • Agile rules I learned over the years
    • Agile Frameworks
      • Scrum
      • OKR
  • Software Engineering Culture
  • Learn how a Computer Works
  • Data Network Transmission
  • Security and Privacy
    • SSL Public and Private Key Certificates
    • JSON Web Tokens
    • GDPR regulations
  • Linux
    • OS Basics
    • Shell scripting
    • Cron Jobs
    • Packet Management
  • Docker
    • What is Docker and How it Works
      • Don't Mess Up Your System
      • Preconfigured Images
      • Take it With You
      • Kubernetes Container Deployment
      • How to Create Start and Stop a Container
      • Docker Micro Services
      • Kubernetes
      • Why and How To Do Docker Container Orchestration
      • Userful Docker Commands
  • The Cloud
    • IaaS vs PaaS vs SaaS
    • AWS Azure IBM Google IBM
    • Cloud vs On-Premises
    • Security
    • Hybrid Clouds
  • Security Zone Design
    • How to secure a multi layered application
    • Cluster security with Kerberos

Advanced Engineering Skills

  • Data Science Platform
    • Why a Good Data Platform Is Important
    • Big Data vs Data Science and Analytics
    • The 4 Vs of Big Data
    • Why Big Data
      • Planning is Everything
      • The Problem with ETL
      • Scaling Up
      • Scaling Out
      • When not to Do Big Data
  • Hadoop Platforms
    • What is Hadoop
    • What makes Hadoop so popular
    • Hadoop Ecosystem Components
    • Hadoop is Everywhere?
    • Should You Learn Hadoop?
    • How to Select Hadoop Cluster Hardware
  • Connect
    • REST APIs
      • API Design
      • Implemenation Frameworks
      • Security
    • Apache Nifi
    • Logstash
  • Buffer
    • Apache Kafka
      • Why a Message Queue Tool?
      • Kafka Architecture
      • Kafka Topics
      • Kafka and Zookeeper
      • How to Produce and Consume Messages
      • Kafka Commands
    • Apache Redis Pub-Sub
    • AWS Kinesis
    • Google Cloud PubSub
  • Processing Frameworks
    • Lambda and Kappa Architecture
    • Batch Processing
    • Stream Processing
      • Three Methods of Streaming
      • At Least Once
      • At Most Once
      • Exactly Once
      • Check The Tools
    • Should You do Stream or Batch Processing
    • Is ETL still relevant for Analytics?
    • MapReduce
      • How Does MapReduce Work
      • MapReduce
      • MapReduce Example
      • MapReduce Limitations
    • Apache Spark
      • What is the Difference to MapReduce?
      • How Spark Fits to Hadoop
      • Spark vs Hadoop
      • Spark and Hadoop a Perfect Fit
      • Spark on YARn
      • My Simple Rule of Thumb
      • Available Languages
      • Spark Driver Executor and SparkContext
      • Spark Batch vs Stream processing
      • How Spark uses Data From Hadoop
      • What are RDDs and How to Use Them
      • SparkSQL How and Why to Use It
      • What are Dataframes and How to Use Them
      • Machine Learning on Spark (TensorFlow)
      • MLlib
      • Spark Setup
      • Spark Resource Management
    • AWS Lambda
    • Apache Flink
    • Elasticsearch
    • Apache Drill
    • StreamSets
  • Store
    • Data Warehouse vs Data Lake
    • SQL Databases
      • PostgreSQL DB
      • Database Design
      • SQL Queries
      • Stored Procedures
      • ODBC/JDBC Server Connections
    • NoSQL Stores
      • HBase KeyValue Store
      • HDFS Document Store
      • MongoDB Document Store
      • Elasticsearch Document Store
      • Hive Warehouse
      • Impala
      • Kudu
      • Apache Druid
      • InfluxDB Time Series Database
      • Greenplum MPP Database
  • Visualize
    • Android and IOS
    • API Design for Mobile Apps
    • Dashboards
      • Grafana
      • Kibana
    • Webservers
      • Tomcat
      • Jetty
      • NodeRED
      • React
    • Business Intelligence Tools
      • Tableau
      • Power BI
      • Quliksense
    • Identity & Device Management
      • What Is A Digital Twin
      • Active Directory
  • Machine Learning
    • How to do Machine Learning in production
    • Why machine learning in production is harder then you think
    • Models Do Not Work Forever
    • Where are The Platforms That Support Machine Learning
    • Training Parameter Management
    • How to Convince People That Machine Learning Works
    • No Rules No Physical Models
    • You Have The Data. Use It!
    • Data is Stronger Than Opinions
    • AWS Sagemaker

Hands On Course

  • What We Want To Do
  • Thoughts On Choosing A Development Environment
  • A Look Into the Twitter API
  • Ingesting Tweets with Apache Nifi
  • Writing from Nifi to Apache Kafka
  • Apache Zeppelin Data Processing
    • Install and Ingest Kafka Topic
    • Processing Messages with Spark & SparkSQL
    • Visualizing Data
  • Switch Processing from Zeppelin to Spark

Case Studies

  • Data Science @Airbnb
  • Data Science @Amazon
  • Data Science @Baidu
  • Data Science @Blackrock
  • Data Science @BMW
  • Data Science @Booking.com
  • Data Science @CERN
  • Data Science @Disney
  • Data Science @DLR
  • Data Science @Drivetribe
  • Data Science @Dropbox
  • Data Science @Ebay
  • Data Science @Expedia
  • Data Science @Facebook
  • Data Science @Google
  • Data Science @Grammarly
  • Data Science @ING Fraud
  • Data Science @Instagram
  • Data Science @LinkedIn
  • Data Science @Lyft
  • Data Science @NASA
  • Data Science @Netflix
  • Data Science @OLX
  • Data Science @OTTO
  • Data Science @Paypal
  • Data Science @Pinterest
  • Data Science @Salesforce
  • Data Science @Siemens Mindsphere
  • Data Science @Slack
  • Data Science @Spotify
  • Data Science @Symantec
  • Data Science @Tinder
  • Data Science @Twitter
  • Data Science @Uber
  • Data Science @Upwork
  • Data Science @Woot
  • Data Science @Zalando

Best Practices Cloud Platforms

  • Amazon Web Services (AWS)
    • Connect
    • Buffer
    • Processing
    • Store
    • Visualize
    • Containerization
    • Best Practices
    • More Details
  • Microsoft Azure
    • Connect
    • Buffer
    • Processing
    • Store
    • Visualize
    • Containerization
    • Best Practices
  • Google Cloud Platform (GCP)
    • Connect
    • Buffer
    • Processing
    • Store
    • Visualize
    • Containerization
    • Best Practices

130+ Free Data Sources For Data Science

  • General And Academic
  • Content Marketing
  • Crime
  • Drugs
  • Education
  • Entertainment
  • Environmental And Weather Data
  • Financial And Economic Data
  • Government And World
  • Health
  • Human Rights
  • Labor And Employment Data
  • Politics
  • Retail
  • Social
  • Travel And Transportation
  • Various Portals
  • Source Articles and Blog Posts
  • Free Data Sources Data Science

1001 Interview Questions

  • Interview Questions

Recommended Books and Courses

  • About Books and Courses
  • Books
    • Languages
      • Java
      • Python
      • Scala
      • Swift
    • Data Science Tools
      • Apache Spark
      • Apache Kafka
      • Apache Hadoop
      • Apache HBase
    • Business
      • The Lean Startup
      • Zero to One
      • The Innovators Dilemma
      • Crossing the Chasm
      • Crush It!
    • Community Recommendations
      • Designing Data-Intensive Applications
  • Online Courses
    • Machine Learning Stanford
    • Computer Networking
    • Spring Framework
    • IOS App Development Specialization

How To Contribute

If you have some cool links or topics for the cookbook, please become a contributor.

Simply pull the repo, add your ideas and create a pull request. You can also open an issue and put your thoughts there.

Please use the "Issues" function for comments.

Support

Everything is free, but please support what you like! Join my Patreon and become a plumber yourself: Link to my Patreon

Or support me and send a message I read on the next livestream through Paypal.me: Link to my Paypal.me/feedthestream

Important Links

Subscribe to my Plumbers of Data Science YouTube channel for regular updates: Link to YouTube

Check out my blog and get updated via mail by joining my mailing list: andreaskretz.com

I have a Medium publication where you can publish your data engineer articles to reach more people: Medium publication


*(As an Amazon Associate I earn from qualifying purchases from Amazon This is free of charge for you, but super helpful for supporting this channel)