bdit_data-sources icon indicating copy to clipboard operation
bdit_data-sources copied to clipboard

Make the GCC puller work in Airflow

Open radumas opened this issue 2 years ago • 11 comments

It will run on the On-Prem Dev server and send data to both the on-prem database and our RDS.

The state of the code is currently in https://github.com/CityofToronto/bdit_data-sources/tree/gcc_view/gis/gccview

Layers that were already being pulled should continue to be pulled and probably automated to be pulled with an audit table.

Only the layers to be used for geoprocessing VFH data should be sent to the On-Prem database.

As part of this we will want to version the centreline and intersections datasets with a parent table of allllll the centreline, child table of versions of centreline (where we add a pulled_date for each version), and then create a view that will also show the most current version.

The query to the centreline API should filter out things we know are not streets in the fcode_desc column: 'Creek/Tributary', 'Ferry Route','Geostatistical line', 'Hydro Line', 'Major Railway', 'Major Shoreline', 'Minor Shoreline (Land locked)', 'River' It would probably be good to alert when there are new fcode_desc that we do not currently have in the database.

I'm not sure what to propose for a similar filter for intersections.

radumas avatar Jul 22 '22 20:07 radumas

Current layers being pulled:

'0_0': ward, '0_2': centreline, '2_2': bike, '2_3': traffic_camera, '2_9': traffic_signal, '2_11': permit_parking, '2_35': prai_transit_shelter, '2_37': tmms_service_request, '2_38': bylaw_pt, '2_39': bylaw_line, '20_1': ev_charging_station, '22_1': day_care, '22_2': middle_child, '23_1': bia, '23_13': proposed_bia, '23_9': film_permit, '23_10': film_parking, '23_12': hotel, '26_1': convenience_store, '26_4': supermarket, '26_3': worship, '26_6': ymca, '26_7': census_tract, '26_11': neighbourhood_impro, '26_13': priority_neigh, '26_16': neigh_demo, '26_45': aborginal, '26_46': attraction, '26_47': dropin, '26_48': early_year, '26_49': family_resource, '26_50': food_bank, '26_53': long_term_care, '26_54': parenting_family_lit, '26_58': retirement, '26_59': senior_housing, '26_61': shelter, '26_62': social_housing, '27_13': private_road, '28_17': school, '28_28': library

qu-xiang avatar Aug 09 '22 18:08 qu-xiang

Following a discussions with Sarah I think we should have a gis_core schema for our core GIS layers and then the rest of the mess can stay in gis.

gis_core

'0_0': ward, '0_2': centreline, '12_41': intersection, '26_7': census_tract, '26_11': neighbourhood_impro, '26_13': priority_neigh, '26_16': neigh_demo,

ptc

'0_0': ward, '0_2': centreline, '12_41': intersection,
'11_23': ibms_district, '11_25': ibms_grid

gis

'2_2': bike, '2_3': traffic_camera, '2_11': permit_parking, '2_35': prai_transit_shelter, '2_37': tmms_service_request, '2_38': bylaw_pt, '2_39': bylaw_line, '2_46'; Loop Detector, --new '20_1': ev_charging_station, '22_1': day_care, '22_2': middle_child, '23_1': bia, '23_13': proposed_bia, '23_9': film_permit, '23_10': film_parking, '23_12': hotel, '26_1': convenience_store, '26_4': supermarket, '26_3': worship, '26_6': ymca, '26_45': aborginal, '26_46': attraction, '26_47': dropin, '26_48': early_year, '26_49': family_resource, '26_50': food_bank, '26_53': long_term_care, '26_54': parenting_family_lit, '26_58': retirement, '26_59': senior_housing, '26_61': shelter, '26_62': social_housing, '27_13': private_road, '28_17': school, '28_28': library

~remove~

'2_9': traffic_signal,

radumas avatar Aug 15 '22 22:08 radumas

This line needs to be updated with an if not exists https://github.com/CityofToronto/bdit_data-sources/blob/5f8197fba0966d28b3b7a74884e164c433adc358/gis/gccview/get_layer_gccview.py#L130

The preceding underscore can be removed.

Also this is a sql injection vector.

radumas avatar Aug 26 '22 15:08 radumas

Added Loop Detector to the list of gis

radumas avatar Aug 30 '22 18:08 radumas

After reviewing all the layers, I have noticed that '26_16': neigh_demo and '2_37': tmms_service_request can no longer be found on REST API. Several other layers have their layer_id changed, which I will update in the code.

qu-xiang avatar Sep 02 '22 15:09 qu-xiang

Regarding 26_16: neighbourhood demographic, apparently its no longer gonna be there but we can request a csv with parameters we want from the census from someone in Community & Social Services

chmnata avatar Sep 02 '22 15:09 chmnata

The municipality layer is going to have to come from the Ontario government :meow_meh:

radumas avatar Sep 21 '22 21:09 radumas

Update on progress:

  • Audit tables and partition tables are now created, audit tables and partition tables functions works, the function has been tested on the bqu schema

Next Steps:

  • [x] Create gis_bot for puller layers in the bigdata database

  • [x] Add vfh layers to gis_core schema in bigdata database

  • [x] manually pull municipality layer to gis_core schema for now

  • [x] Modify function to include click options

  • [x] Include Error catching

  • [x] Update the DAG's Task creation loop

chmnata avatar Oct 13 '22 21:10 chmnata

CodeFactor says that I need to have 'verify' to be true when requesting information from rest API. r = requests.get(url, verify = True) But when I tried to set 'verify' to True, it didn't work, and raised SSLError: HTTPSConnectionPool(host='insideto-gis.toronto.ca', port=443): Max retries exceeded with url: /arcgis/rest/services/cot_geospatial/MapServer/layers?f=json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))

Is there any way to make the verification work, or would the call be to ignore CodeFactor on this issue? @radumas

qu-xiang avatar Oct 20 '22 22:10 qu-xiang

Good catch.

Unfortunately because the site is internal it doesn't seem the SSL certificate is signed properly/recognized so, I think we can ignore the error.

radumas avatar Oct 20 '22 22:10 radumas

@qu-xiang Can you update the table name for municipalities to gis_core.municipalities ?

chmnata avatar Oct 21 '22 18:10 chmnata

There seems to be five types of errors reported in Morbius Airflow logs.

  1. UndefinedTable: missing FROM-clause entry for table "gis_core"
  2. InvalidColumnReference: there is no unique or exclusion constraint matching the ON CONFLICT specification
  3. CardinalityViolation: more than one row returned by a subquery used as an expression
  4. InvalidTableDefinition: multiple primary keys for table "_loop_detector" are not allowed
  5. SyntaxError: INSERT has more expressions than target columns

The detailed documentation of the type of error for each layer is listed in this Spreadsheet

qu-xiang avatar Dec 14 '22 04:12 qu-xiang

Type 1 Suspected reason: When executing select {schema}.audit_table({schema}.{tablename}), SQL thinks the schema name (e.g. gis_core) is a table name

  • The query is trying to add the table to audit table list
  • The query works if I just use it in pgAdmin
  • Not all layers that are supposed to be audited have that issue, it is unclear why some have this error
  • Simply removing the {schema} identifier within audit_table() also doesn't work, because there may be duplicate table names in the server

qu-xiang avatar Dec 14 '22 15:12 qu-xiang

Type 1 error

The below query composition isn't quite correct

https://github.com/CityofToronto/bdit_data-sources/blob/e8d935b9496c39d968ee1165dd6561f8770f8b3b/gis/gccview/gcc_layer_refresh.py#L565-L566

Following some testing in pgAdmin, it looks like the function is expecting 'schema_name.table_name' to be in a string, e.g.: select gis_core.audit_table('gis_core.municipalities')

There might be an error on

https://github.com/CityofToronto/bdit_data-sources/blob/e8d935b9496c39d968ee1165dd6561f8770f8b3b/gis/gccview/gcc_layer_refresh.py#L563

From the 2nd example in the documentation if you're trying to combine a schema and a table to pass, this should be handled in one sql.Identifier() constructor, e.g.:

>>> query = sql.SQL("select {} from {}").format(sql.Identifier("table", "field"),
                                                sql.Identifier("schema", "table"))
>>> print(query.as_string(conn))
select "table"."field" from "schema"."table"

radumas avatar Jan 03 '23 19:01 radumas