bdit_data-sources
bdit_data-sources copied to clipboard
Refactor weather pipelines
we should have a weather
folder in this repo and a corresponding weather
schema to hold our weather tables
Python package for pulling env canada data https://pypi.org/project/env-canada/#description
Historical Daily Weather script here: https://github.com/Toronto-Big-Data-Innovation-Team/activeto/blob/jasonlee/weekend_closures/scripts/import_weather.py
Things to add/change:
- [x] change destination table to a table in
weather
schema - [x] change table name to
historical_daily
- [x] change the script to run daily and not monthly
- [ ] create a DAG that runs daily with separate tasks for 1) pulling the data, 2) inserting the data to our database, as well as slack error alert failure callback
Currently the historical table has the following columns:
weather_uid, climate_id, dt, temp_max, temp_min, temp_mean, total_precip_mm
@tankedman mentioned that there are condition (e.g. Partly Cloudy), and wind speed, so we are adding that into the table as well
Wondering if the weather_uid
, and climate_id
columns are necessary 🤔
Can you also add a unique constraint for dt
on the table so we don't insert duplicated data? As well as adding a index on dt
. Thanks!!
Just fyi: Weather_uid and Climate_id are references to the Environment Canada database. Yes, will add new columns and impose UNIQUE on dt.
Created two tables in weather schema:
historical_daily: tracks weather on a daily basis, will be pulled at end of day by script prediction_daily: tracks weather prediction on a daily basis, based on the prediction from the previous day
ahh I see, yea I think we can exclude the weather_uid and climate_id
Added a weather_bot
for the DAG, connection added on airflow
-
Unable to access historical weather classes
ECHistorical
andECHistoricalRange
in theenv_canada
python package, so currently only able to pull the current day's weather. Likely have to scrape Environment Canada manually. -
env_canada
package only able to get next 5 days of forecast. will modifyprediction_import.py
to pull 5 days at a time, overwriting previous dates.
As per discussion with @tahaislam @tankedman, modification needed on :meow_salute: :
Prediction script:
- [x] change from pulling 1 day of data to pull for all 5 available days
- [x] add a column for
date_pull
(date) in the prediction table - [x] insert 5 days of data with
upsert
script, overwriting data with the same date and updating the columndate_pull