bdit_data-sources
bdit_data-sources copied to clipboard
WYS Aggregation fails when inserting NULL as a volume
[2022-10-05 15:39:04,339] {wys_api.py:363} CRITICAL - Error aggregating data to 1-hour bins [2022-10-05 15:39:04,339] {wys_api.py:364} CRITICAL - null value in column "volume" violates not-null constraint DETAIL: Failing row contains (179811788, 1750, 2022-10-04 00:00:00, 1, null). CONTEXT: SQL statement "WITH insert_data AS ( --Aggregated into speed bins and 1 hour bin INSERT INTO wys.speed_counts_agg_5kph (api_id, datetime_bin, speed_id, volume) SELECT api_id, date_trunc('hour', datetime_bin) dt, speed_id, sum(count) AS volume FROM wys.raw_data INNER JOIN wys.speed_bins_old ON speed <@ speed_bin WHERE speed_count_uid IS NULL GROUP BY api_id, dt, speed_id RETURNING speed_counts_agg_5kph_id, api_id, datetime_bin, speed_id )
UPDATE wys.raw_data A SET speed_count_uid=B.speed_counts_agg_5kph_id FROM insert_data B WHERE A.speed_count_uid IS NULL AND A.api_id=B.api_id AND A.datetime_bin >= B.datetime_bin AND A.datetime_bin < B.datetime_bin + INTERVAL '1 hour'" PL/pgSQL function wys.aggregate_speed_counts_one_hour_5kph() line 6 at SQL statement
Few issues:
- [x] Based on some analysis of
raw_data
, there is no data being inserted after Oct 4th. Try re-pulling. - [x] The CTE filter
WHERE speed_count_uid IS NULL
should include the execution date - [x]
date_trunc('hour', datetime_bin) dt
labelling a date-hour asdt
is confusing,dt
implies "date" - [x] add another filter to the
WHERE
clause in the CTE to"count" IS NOT NULL
. When we revisit this to better deal with signs being out we can investigate using these NULL values somehow https://github.com/CityofToronto/bdit_data-sources/blob/aa21a3129ee18efbe6550a422ecdd5a9ddf96b4b/wys/api/sql/function-aggregate_speed_counts_one_hour_5kph.sql#L23 - [x] the error message for the aggregation query should specify that the inserted data is uncommitted to make it clear that the first insert gets rolled back on error https://github.com/CityofToronto/bdit_data-sources/blob/aa21a3129ee18efbe6550a422ecdd5a9ddf96b4b/wys/api/python/wys_api.py#L355-L362
Was able to query the API directly in the investgate_settings_pull.ipynb
It appears the API is now returning values like:
{'LocInfo': {'Location': {'name': 'Seneca Hill Dr SB 16101198',
'address': '669 Seneca Hill Drive',
'state': 'Ontario',
'country': 'Canada',
'zip': 'M2J 4X9',
'geocode': '(43.792496297754695,-79.35067848878938)'},
'Parameters': {'date': '2022-10-04',
'from': '00:00',
'to': '00:59',
'speed_type': 'km/h'},
'raw_records': [{'datetime': '2022-10-04 00:04:00',
'counter': [{'speed': '0', 'count': ''}]},
{'datetime': '2022-10-04 00:09:00',
'counter': [{'speed': '0', 'count': ''}]},
{'datetime': '2022-10-04 00:14:00',
'counter': [{'speed': '47', 'count': '1'},
{'speed': '37', 'count': '1'},
{'speed': '33', 'count': '2'}]},
{'datetime': '2022-10-04 00:19:00',
'counter': [{'speed': '0', 'count': ''}]},
{'datetime': '2022-10-04 00:24:00',
'counter': [{'speed': '0', 'count': ''}]},
{'datetime': '2022-10-04 00:29:00',
'counter': [{'speed': '0', 'count': ''}]},
{'datetime': '2022-10-04 00:34:00',
'counter': [{'speed': '0', 'count': ''}]},
{'datetime': '2022-10-04 00:39:00',
'counter': [{'speed': '38', 'count': '1'}]},
{'datetime': '2022-10-04 00:44:00',
'counter': [{'speed': '37', 'count': '1'}]},
{'datetime': '2022-10-04 00:49:00',
'counter': [{'speed': '0', 'count': ''}]},
{'datetime': '2022-10-04 00:54:00',
'counter': [{'speed': '0', 'count': ''}]},
{'datetime': '2022-10-04 00:59:00',
'counter': [{'speed': '39', 'count': '1'}]}]}}
Where before it would return NULL
for speed as well
SELECT api_id, datetime_bin, speed, count, speed_count_uid
FROM wys.raw_data
WHERE datetime_bin >= '2022-10-03' AND "count" is NULL
With the old API the INNER JOIN
in the aggregation would be empty so therefore SUM
wouldn't be trying to SUM(NULL)
which results in NULL
On top of the enhancement listed in the comment above:
There are a couple of functions under wys.aggregate_speed_counts_one_hour_5kph()
, one without param, one with _mon
and one with _start_date
and _end_date
. I believe only wys.aggregate_speed_counts_one_hour_5kph()
is used in the wys_api.py, can I get rid of the other ones? Mainly the _mon
one so I can add execution date as a param @radumas
Good point.
There's 2 different sql files committed to master at tthe moment. https://github.com/CityofToronto/bdit_data-sources/blob/master/wys/api/sql/function-aggregate_speed_counts_one_hour_5kph.sql contains the function that gets used and that _mon
one.
I feel like there should primarily be a _start_date
and _end_date
function which the airflow DAG could use, with the caveat that it's not updating the raw_data with what rows have been inserted.
Why without the updating raw_data part?
The function should do that, IDK why that one doesn't
ahh ok I will create one with _start_date and _end_date with the updating raw_data part and use that in the wys_api.py for da DAG