bdit_data-sources
bdit_data-sources copied to clipboard
Make the GCC puller work in Airflow
It will run on the On-Prem Dev server and send data to both the on-prem database and our RDS.
The state of the code is currently in https://github.com/CityofToronto/bdit_data-sources/tree/gcc_view/gis/gccview
Layers that were already being pulled should continue to be pulled and probably automated to be pulled with an audit table.
Only the layers to be used for geoprocessing VFH data should be sent to the On-Prem database.
As part of this we will want to version the centreline and intersections datasets with a parent table of allllll the centreline, child table of versions of centreline (where we add a pulled_date
for each version), and then create a view that will also show the most current version.
The query to the centreline API should filter out things we know are not streets in the fcode_desc
column:
'Creek/Tributary', 'Ferry Route','Geostatistical line', 'Hydro Line', 'Major Railway', 'Major Shoreline', 'Minor Shoreline (Land locked)', 'River'
It would probably be good to alert when there are new fcode_desc that we do not currently have in the database.
I'm not sure what to propose for a similar filter for intersections.
Current layers being pulled:
'0_0': ward, '0_2': centreline, '2_2': bike, '2_3': traffic_camera, '2_9': traffic_signal, '2_11': permit_parking, '2_35': prai_transit_shelter, '2_37': tmms_service_request, '2_38': bylaw_pt, '2_39': bylaw_line, '20_1': ev_charging_station, '22_1': day_care, '22_2': middle_child, '23_1': bia, '23_13': proposed_bia, '23_9': film_permit, '23_10': film_parking, '23_12': hotel, '26_1': convenience_store, '26_4': supermarket, '26_3': worship, '26_6': ymca, '26_7': census_tract, '26_11': neighbourhood_impro, '26_13': priority_neigh, '26_16': neigh_demo, '26_45': aborginal, '26_46': attraction, '26_47': dropin, '26_48': early_year, '26_49': family_resource, '26_50': food_bank, '26_53': long_term_care, '26_54': parenting_family_lit, '26_58': retirement, '26_59': senior_housing, '26_61': shelter, '26_62': social_housing, '27_13': private_road, '28_17': school, '28_28': library
Following a discussions with Sarah I think we should have a gis_core
schema for our core GIS layers and then the rest of the mess can stay in gis
.
gis_core
'0_0': ward, '0_2': centreline, '12_41': intersection, '26_7': census_tract, '26_11': neighbourhood_impro, '26_13': priority_neigh, '26_16': neigh_demo,
ptc
'0_0': ward,
'0_2': centreline,
'12_41': intersection,
'11_23': ibms_district,
'11_25': ibms_grid
gis
'2_2': bike, '2_3': traffic_camera, '2_11': permit_parking, '2_35': prai_transit_shelter, '2_37': tmms_service_request, '2_38': bylaw_pt, '2_39': bylaw_line, '2_46'; Loop Detector, --new '20_1': ev_charging_station, '22_1': day_care, '22_2': middle_child, '23_1': bia, '23_13': proposed_bia, '23_9': film_permit, '23_10': film_parking, '23_12': hotel, '26_1': convenience_store, '26_4': supermarket, '26_3': worship, '26_6': ymca, '26_45': aborginal, '26_46': attraction, '26_47': dropin, '26_48': early_year, '26_49': family_resource, '26_50': food_bank, '26_53': long_term_care, '26_54': parenting_family_lit, '26_58': retirement, '26_59': senior_housing, '26_61': shelter, '26_62': social_housing, '27_13': private_road, '28_17': school, '28_28': library
~remove~
'2_9': traffic_signal,
This line needs to be updated with an if not exists
https://github.com/CityofToronto/bdit_data-sources/blob/5f8197fba0966d28b3b7a74884e164c433adc358/gis/gccview/get_layer_gccview.py#L130
The preceding underscore can be removed.
Also this is a sql injection vector.
Added Loop Detector to the list of gis
After reviewing all the layers, I have noticed that '26_16': neigh_demo and '2_37': tmms_service_request can no longer be found on REST API. Several other layers have their layer_id
changed, which I will update in the code.
Regarding 26_16: neighbourhood demographic, apparently its no longer gonna be there but we can request a csv with parameters we want from the census from someone in Community & Social Services
The municipality layer is going to have to come from the Ontario government :meow_meh:
Update on progress:
- Audit tables and partition tables are now created, audit tables and partition tables functions works, the function has been tested on the
bqu
schema
Next Steps:
-
[x] Create
gis_bot
for puller layers in the bigdata database -
[x] Add vfh layers to
gis_core
schema in bigdata database -
[x] manually pull municipality layer to gis_core schema for now
-
[x] Modify function to include click options
-
[x] Include Error catching
-
[x] Update the DAG's Task creation loop
CodeFactor says that I need to have 'verify' to be true when requesting information from rest API.
r = requests.get(url, verify = True)
But when I tried to set 'verify' to True, it didn't work, and raised SSLError: HTTPSConnectionPool(host='insideto-gis.toronto.ca', port=443): Max retries exceeded with url: /arcgis/rest/services/cot_geospatial/MapServer/layers?f=json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))
Is there any way to make the verification work, or would the call be to ignore CodeFactor on this issue? @radumas
Good catch.
Unfortunately because the site is internal it doesn't seem the SSL certificate is signed properly/recognized so, I think we can ignore the error.
@qu-xiang Can you update the table name for municipalities to gis_core.municipalities
?
There seems to be five types of errors reported in Morbius Airflow logs.
- UndefinedTable: missing FROM-clause entry for table "gis_core"
- InvalidColumnReference: there is no unique or exclusion constraint matching the ON CONFLICT specification
- CardinalityViolation: more than one row returned by a subquery used as an expression
- InvalidTableDefinition: multiple primary keys for table "_loop_detector" are not allowed
- SyntaxError: INSERT has more expressions than target columns
The detailed documentation of the type of error for each layer is listed in this Spreadsheet
Type 1
Suspected reason:
When executing select {schema}.audit_table({schema}.{tablename})
, SQL thinks the schema name (e.g. gis_core) is a table name
- The query is trying to add the table to audit table list
- The query works if I just use it in pgAdmin
- Not all layers that are supposed to be audited have that issue, it is unclear why some have this error
- Simply removing the {schema} identifier within
audit_table()
also doesn't work, because there may be duplicate table names in the server
Type 1 error
The below query composition isn't quite correct
https://github.com/CityofToronto/bdit_data-sources/blob/e8d935b9496c39d968ee1165dd6561f8770f8b3b/gis/gccview/gcc_layer_refresh.py#L565-L566
Following some testing in pgAdmin, it looks like the function is expecting 'schema_name.table_name'
to be in a string, e.g.: select gis_core.audit_table('gis_core.municipalities')
There might be an error on
https://github.com/CityofToronto/bdit_data-sources/blob/e8d935b9496c39d968ee1165dd6561f8770f8b3b/gis/gccview/gcc_layer_refresh.py#L563
From the 2nd example in the documentation if you're trying to combine a schema and a table to pass, this should be handled in one sql.Identifier()
constructor, e.g.:
>>> query = sql.SQL("select {} from {}").format(sql.Identifier("table", "field"),
sql.Identifier("schema", "table"))
>>> print(query.as_string(conn))
select "table"."field" from "schema"."table"