bdit_data-sources
bdit_data-sources copied to clipboard
WYS Sheets: Some wards not being inserted
Wards: 5,8,14,20 are missing data.
I suspect this is because of the primary key constraint on (installation_date, new_sign_number)
. Unfortunately nobody ever checked to see the data was actually successfully inserted.
This is the log output for Ward 14
[2020-12-06 15:00:25,395] {logging_mixin.py:112} INFO - [2020-12-06 15:00:25,395] {wys_google_sheet.py:84} INFO - ward_14 does not have any row with different date format
[2020-12-06 15:00:25,396] {logging_mixin.py:112} INFO - [2020-12-06 15:00:25,396] {wys_google_sheet.py:91} INFO - Uploaded 232 rows to PostgreSQL for ward_14
[2020-12-06 15:00:25,404] {logging_mixin.py:112} INFO - [2020-12-06 15:00:25,404] {wys_google_sheet.py:100} ERROR - Duplicates of primary keys in ward_14 !!!
[2020-12-06 15:00:25,405] {logging_mixin.py:112} INFO - [2020-12-06 15:00:25,404] {wys_google_sheet.py:101} ERROR - ON CONFLICT DO UPDATE command cannot affect row a second time
HINT: Ensure that no rows proposed for insertion within the same command have duplicate constrained values.
- [x] the successful message should happen after data is successfully inserted
- [x]
Duplicates of primary keys in ward_14 !!!
is hard coded for any error in insertion, which is not very helpful.
Testing on Ward 14.
ALTER TABLE wys.ward_14
DROP CONSTRAINT ward_14_pkey;
--INSERT script needs to have some constraint on the table.
ALTER TABLE wys.ward_14
ADD CONSTRAINT ward_14_pkey PRIMARY KEY (location, from_street, to_street, direction, installation_date, removal_date, new_sign_number)
Had to modify dev
script to change the insert condition (I should've just removed both and then insert would have failed on the other tables but oh well... insert still fails on the other tables).
insert = sql.SQL('''INSERT INTO {}.{} (ward_no, location, from_street, to_street, direction, installation_date, removal_date,
new_sign_number, comments, confirmed) VALUES %s
ON CONFLICT (location, from_street, to_street, direction, installation_date, removal_date, new_sign_number)
DO UPDATE SET
removal_date=EXCLUDED.removal_date, new_sign_number=EXCLUDED.new_sign_number, comments=EXCLUDED.comments
''').format(sql.Identifier(schema_name), sql.Identifier(table_name))
Running DAG manually.
Finding all duplicates with
SELECT ward_no, location, from_street, to_street, direction, installation_date, removal_date, new_sign_number, comments
FROM wys.mobile_sign_installations
NATURAL JOIN (SELECT ward_no, new_sign_number, installation_date
FROM wys.mobile_sign_installations
GROUP BY ward_no, new_sign_number, installation_date
HAVING COUNT(*)> 1) dupes
ORDER BY ward_no, installation_date;
Sent to team to update.
Ran the following to clean up the existing data
-- BEGIN;
-- UPDATE wys.mobile_sign_installations
-- SET --ward_no=?, location=?, from_street=?, to_street=?, direction=?, installation_date=?, removal_date=?,
-- new_sign_number='3'
-- WHERE (ward_no, location, from_street, to_street, direction, installation_date, removal_date, new_sign_number)=(5,'Ralph St','Macdonald Ave','Lawrence Ave W','SB','2019-01-10','2019-02-01','2');
-- COMMIT;
-- BEGIN;
-- UPDATE wys.mobile_sign_installations
-- SET --ward_no=?, location=?, from_street=?, to_street=?, direction=?, installation_date=?, removal_date=?,
-- new_sign_number='6'
-- WHERE (ward_no, location, from_street, to_street, direction, installation_date, removal_date, new_sign_number)=
-- (8,'Fairlawn Ave','Greer Rd','Elm Rd','EB','2018-12-06','2019-01-04','1');
-- COMMIT;
-- BEGIN;
-- DELETE FROM wys.mobile_sign_installations
-- WHERE (ward_no, location, from_street, to_street, direction, installation_date, removal_date, new_sign_number)=
-- (8,'Playfair Ave','Dufferin St','Haven Rd','WB','2019-01-04','2019-01-24','1');
-- COMMIT;
-- BEGIN;
-- UPDATE wys.mobile_sign_installations
-- SET --ward_no=?, location=?, from_street=?, to_street=?, direction=?, installation_date=?, removal_date=?,
-- new_sign_number='6'
-- WHERE (ward_no, location, from_street, to_street, direction, installation_date, removal_date, new_sign_number)=
-- (8,'Fairlawn Ave','Greer Rd','Elm Rd','WB','2019-01-04','2019-01-25','1'
-- );
-- COMMIT;
-- BEGIN;
-- UPDATE wys.mobile_sign_installations
-- SET --ward_no=?, location=?, from_street=?, to_street=?, direction=?, installation_date=?, removal_date=?,
-- new_sign_number='2'
-- WHERE (ward_no, location, from_street, to_street, direction, installation_date, removal_date, new_sign_number)=
-- (14,'Floyd Ave','Pape','Donlands Ave. ','EB','2019-01-09','2019-02-01','1'
-- );
-- COMMIT;
-- BEGIN;
-- UPDATE wys.mobile_sign_installations
-- SET --ward_no=?, location=?, from_street=?, to_street=?, direction=?, installation_date=?, removal_date=?,
-- new_sign_number='2'
-- WHERE (ward_no, location, from_street, to_street, direction, installation_date, removal_date, new_sign_number)=
-- (14,'Floyd Ave','Pape','Donlands Ave. ','WB','2019-02-01','2019-02-20','1'
-- );
-- COMMIT;
-- BEGIN;
-- DELETE FROM wys.mobile_sign_installations
-- WHERE (ward_no, location, from_street, to_street, direction, installation_date, removal_date, new_sign_number)=
-- (20,'Kennedy Rd','Transway Cres','Merrian Rd','SB','2019-07-03','2019-07-25','4'
-- );
-- COMMIT;
-- BEGIN;
-- DELETE FROM wys.mobile_sign_installations
-- WHERE (ward_no, location, from_street, to_street, direction, installation_date, removal_date, new_sign_number)=
-- (20,'Aylesworth Ave','Highview Ave','Lillington Ave','NB','2020-01-06','2020-01-06','3'
-- );
-- COMMIT;
https://github.com/CityofToronto/bdit_data-sources/commit/b23abd8b770a97186bc99d9e0220885f19372eac
The above commit creates a separate table to rows that fail the duplicate check (signs for a given ward that have the same installation date).
@KatiRG it would be nice if there was a separate airflow task in this dag that did a SELECT
from that table and sent a Slack message if there are rows that are in that table so we can email whoever is responsible for maintaining the spreadsheet about it.
In order to ensure data was aggregated up to OpenData
Identified months that were missing mobile data for individual wards
WITH summary AS(
SELECT ward_no, date_trunc('month', removal_date)::DATE mon, COUNT(1)
FROM wys.mobile_summary
GROUP BY ward_no, mon
)
,installs AS(
SELECT ward_no, date_trunc('month', removal_date)::DATE mon, COUNT(1)
FROM wys.mobile_sign_installations
GROUP BY ward_no, mon
)
SELECT ward_no, mon, installs.count, summary.count
FROM installs
LEFT OUTER JOIN summary USING(ward_no, mon)
WHERE summary.count IS NULL AND mon < '2021-04-01'
ORDER BY mon
Ran a version of the wys.mobile_summary_for_month
function after refreshing the mobile_api_id
view using the following WHERE
clauses instead of running the function by month.
WHERE removal_date < '2020-12-01' AND ward_no IN (5,8,14,20)
-- some other wards that were missing months of data
WHERE removal_date >= '2020-04-01' AND removal_date < '2020-12-01' AND ward_no IN (2,10)
SELECT COUNT(1) = 0 FROM wys.mobile_sign_installations_dupes;
returns true
so I tried using SQLCheckOperator which is supposed to raise an error message if False
.
But this dag returns an error message even though the query returns true:
task1 = SQLCheckOperator(
task_id='check_wys_mobile_dupes',
sql='''SELECT COUNT(1) = 0 FROM wys.mobile_sign_installations_dupes''',
postgres_conn_id='wys_bot',
#use_legacy_sql=False,
dag=DUPES_DAG
)