data-engineer-handbook
data-engineer-handbook copied to clipboard
This is a repo with links to everything you'd ever want to learn about data engineering
This commit corrects the homework markdown mentioning the fields used as the primary key in the actor_films table. The document references actor_id and film_id instead of actorid and filmid.
The server.py file initializes Statsig with `statsig.initialize(API_KEY)`. If the `API_KEY` is invalid or Statsig initialization fails for any reason, it will crash the application. It needs exception handling.
In the `team_vertex_job.py` file, the `main` function incorrectly attempts to write the output DataFrame into a table named `"players_scd"`, which is likely meant for the players SCD job. This will...
The `start_job.py` file creates a Kafka sink named `process_events_kafka`, while `aggregation_job.py` creates a Kafka source with the same name `process_events_kafka`. This can cause confusion and potentially lead to the aggregation...
add Data Engineering Whitepapers https://www.ssp.sh/brain/data-engineering-whitepapers/ on Data Engineering Whitepapers
## Issue Wanted to call out that this says Scholar Spark instead of Scala Spark 
## Tables Not loaded using Docker 1. Copy your .dump file into the container `docker cp .\data.dump my-postgres-container:/tmp/data.dump` 2. Run pg_restore inside the container `docker exec -it my-postgres-container pg_restore -U...
The map function call `filtered_tasks = ''.join(map(lambda a: ...)` in the `/tasks` route of `server.py` is creating a string by joining a list of strings, which is correct, but it...
In the `server.py` file, the signup and task routes generate user IDs using `hash(hash_string)`. The hash function is not guaranteed to produce the same hash value across different Python processes...