bdit_data-sources icon indicating copy to clipboard operation
bdit_data-sources copied to clipboard

WYS Aggregation fails when inserting NULL as a volume

Open radumas opened this issue 1 year ago • 6 comments

[2022-10-05 15:39:04,339] {wys_api.py:363} CRITICAL - Error aggregating data to 1-hour bins [2022-10-05 15:39:04,339] {wys_api.py:364} CRITICAL - null value in column "volume" violates not-null constraint DETAIL: Failing row contains (179811788, 1750, 2022-10-04 00:00:00, 1, null). CONTEXT: SQL statement "WITH insert_data AS ( --Aggregated into speed bins and 1 hour bin INSERT INTO wys.speed_counts_agg_5kph (api_id, datetime_bin, speed_id, volume) SELECT api_id, date_trunc('hour', datetime_bin) dt, speed_id, sum(count) AS volume FROM wys.raw_data INNER JOIN wys.speed_bins_old ON speed <@ speed_bin WHERE speed_count_uid IS NULL GROUP BY api_id, dt, speed_id RETURNING speed_counts_agg_5kph_id, api_id, datetime_bin, speed_id )

UPDATE wys.raw_data A SET speed_count_uid=B.speed_counts_agg_5kph_id FROM insert_data B WHERE A.speed_count_uid IS NULL AND A.api_id=B.api_id AND A.datetime_bin >= B.datetime_bin AND A.datetime_bin < B.datetime_bin + INTERVAL '1 hour'" PL/pgSQL function wys.aggregate_speed_counts_one_hour_5kph() line 6 at SQL statement

Few issues:

  • [x] Based on some analysis of raw_data, there is no data being inserted after Oct 4th. Try re-pulling.
  • [x] The CTE filter WHERE speed_count_uid IS NULL should include the execution date
  • [x] date_trunc('hour', datetime_bin) dt labelling a date-hour as dt is confusing, dt implies "date"
  • [x] add another filter to the WHERE clause in the CTE to "count" IS NOT NULL. When we revisit this to better deal with signs being out we can investigate using these NULL values somehow https://github.com/CityofToronto/bdit_data-sources/blob/aa21a3129ee18efbe6550a422ecdd5a9ddf96b4b/wys/api/sql/function-aggregate_speed_counts_one_hour_5kph.sql#L23
  • [x] the error message for the aggregation query should specify that the inserted data is uncommitted to make it clear that the first insert gets rolled back on error https://github.com/CityofToronto/bdit_data-sources/blob/aa21a3129ee18efbe6550a422ecdd5a9ddf96b4b/wys/api/python/wys_api.py#L355-L362

radumas avatar Oct 12 '22 14:10 radumas

Was able to query the API directly in the investgate_settings_pull.ipynb

It appears the API is now returning values like:

{'LocInfo': {'Location': {'name': 'Seneca Hill Dr SB 16101198',
   'address': '669 Seneca Hill Drive',
   'state': 'Ontario',
   'country': 'Canada',
   'zip': 'M2J 4X9',
   'geocode': '(43.792496297754695,-79.35067848878938)'},
  'Parameters': {'date': '2022-10-04',
   'from': '00:00',
   'to': '00:59',
   'speed_type': 'km/h'},
  'raw_records': [{'datetime': '2022-10-04 00:04:00',
    'counter': [{'speed': '0', 'count': ''}]},
   {'datetime': '2022-10-04 00:09:00',
    'counter': [{'speed': '0', 'count': ''}]},
   {'datetime': '2022-10-04 00:14:00',
    'counter': [{'speed': '47', 'count': '1'},
     {'speed': '37', 'count': '1'},
     {'speed': '33', 'count': '2'}]},
   {'datetime': '2022-10-04 00:19:00',
    'counter': [{'speed': '0', 'count': ''}]},
   {'datetime': '2022-10-04 00:24:00',
    'counter': [{'speed': '0', 'count': ''}]},
   {'datetime': '2022-10-04 00:29:00',
    'counter': [{'speed': '0', 'count': ''}]},
   {'datetime': '2022-10-04 00:34:00',
    'counter': [{'speed': '0', 'count': ''}]},
   {'datetime': '2022-10-04 00:39:00',
    'counter': [{'speed': '38', 'count': '1'}]},
   {'datetime': '2022-10-04 00:44:00',
    'counter': [{'speed': '37', 'count': '1'}]},
   {'datetime': '2022-10-04 00:49:00',
    'counter': [{'speed': '0', 'count': ''}]},
   {'datetime': '2022-10-04 00:54:00',
    'counter': [{'speed': '0', 'count': ''}]},
   {'datetime': '2022-10-04 00:59:00',
    'counter': [{'speed': '39', 'count': '1'}]}]}}

Where before it would return NULL for speed as well

SELECT api_id, datetime_bin, speed, count, speed_count_uid
	FROM wys.raw_data
    WHERE datetime_bin >= '2022-10-03' AND "count" is NULL

image

With the old API the INNER JOIN in the aggregation would be empty so therefore SUM wouldn't be trying to SUM(NULL) which results in NULL

On top of the enhancement listed in the comment above:

radumas avatar Oct 13 '22 16:10 radumas

There are a couple of functions under wys.aggregate_speed_counts_one_hour_5kph(), one without param, one with _mon and one with _start_date and _end_date. I believe only wys.aggregate_speed_counts_one_hour_5kph() is used in the wys_api.py, can I get rid of the other ones? Mainly the _mon one so I can add execution date as a param @radumas

chmnata avatar Oct 17 '22 17:10 chmnata

Good point.

There's 2 different sql files committed to master at tthe moment. https://github.com/CityofToronto/bdit_data-sources/blob/master/wys/api/sql/function-aggregate_speed_counts_one_hour_5kph.sql contains the function that gets used and that _mon one.

I feel like there should primarily be a _start_date and _end_date function which the airflow DAG could use, with the caveat that it's not updating the raw_data with what rows have been inserted.

radumas avatar Oct 17 '22 17:10 radumas

Why without the updating raw_data part?

chmnata avatar Oct 17 '22 18:10 chmnata

The function should do that, IDK why that one doesn't

radumas avatar Oct 17 '22 18:10 radumas

ahh ok I will create one with _start_date and _end_date with the updating raw_data part and use that in the wys_api.py for da DAG

chmnata avatar Oct 17 '22 18:10 chmnata