openverse
openverse copied to clipboard
Changes to Catalog of Life data causing iNaturalist failure
Airflow log link
Note: Airflow is currently only accessible to maintainers & those given access. If you would like access to Airflow, please reach out to a member of @WordPress/openverse-maintainers.
https://airflow.openverse.org/dags/inaturalist_workflow/grid?dag_run_id=scheduled__2024-08-02T00%3A00%3A00%2B00%3A00&task_id=ingest_data.preingestion_tasks.load_catalog_of_life_names&base_date=2024-08-02T00%3A00%3A00%2B0000&tab=logs
Description
We're seeing the following issue with the iNaturalist DAG:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py", line 762, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py", line 733, in _execute_callable
return ExecutionCallableRunner(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/operator_helpers.py", line 252, in run
return self.func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py", line 406, in wrapper
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/operators/python.py", line 238, in execute
return_value = self.execute_callable()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/operators/python.py", line 256, in execute_callable
return runner.run(*self.op_args, **self.op_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/operator_helpers.py", line 252, in run
return self.func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/airflow/catalog/dags/providers/provider_api_scripts/inaturalist.py", line 276, in load_catalog_of_life_names
pg.copy_expert(COPY_SQL.format("col_vernacular"), OUTPUT_DIR / vernacular_file)
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/postgres/hooks/postgres.py", line 197, in copy_expert
cur.copy_expert(sql, file)
psycopg2.errors.BadCopyFileFormat: extra data after last expected column
CONTEXT: COPY col_vernacular, line 2: "49VSW Longnose cusk-eel Longnose cusk-eel eng US "
This can follow the same process as #4711 for how to identify and address. Since it happens on line 2 (immediately after the header), there's likely been a schema change.
Reproduction
I'm able to reproduce this locally by running the inaturalist_workflow
.
DAG status
Since this is a terminal failure, I've paused the DAG until this can be addressed.