openverse icon indicating copy to clipboard operation
openverse copied to clipboard

Changes to Catalog of Life data causing iNaturalist failure

Open AetherUnbound opened this issue 5 months ago • 0 comments

Airflow log link

Note: Airflow is currently only accessible to maintainers & those given access. If you would like access to Airflow, please reach out to a member of @WordPress/openverse-maintainers.

https://airflow.openverse.org/dags/inaturalist_workflow/grid?dag_run_id=scheduled__2024-08-02T00%3A00%3A00%2B00%3A00&task_id=ingest_data.preingestion_tasks.load_catalog_of_life_names&base_date=2024-08-02T00%3A00%3A00%2B0000&tab=logs

Description

We're seeing the following issue with the iNaturalist DAG:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py", line 762, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py", line 733, in _execute_callable
    return ExecutionCallableRunner(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/operator_helpers.py", line 252, in run
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py", line 406, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/operators/python.py", line 238, in execute
    return_value = self.execute_callable()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/operators/python.py", line 256, in execute_callable
    return runner.run(*self.op_args, **self.op_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/operator_helpers.py", line 252, in run
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/inaturalist.py", line 276, in load_catalog_of_life_names
    pg.copy_expert(COPY_SQL.format("col_vernacular"), OUTPUT_DIR / vernacular_file)
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/postgres/hooks/postgres.py", line 197, in copy_expert
    cur.copy_expert(sql, file)
psycopg2.errors.BadCopyFileFormat: extra data after last expected column
CONTEXT:  COPY col_vernacular, line 2: "49VSW		Longnose cusk-eel	Longnose cusk-eel	eng		US						"

This can follow the same process as #4711 for how to identify and address. Since it happens on line 2 (immediately after the header), there's likely been a schema change.

Reproduction

I'm able to reproduce this locally by running the inaturalist_workflow.

DAG status

Since this is a terminal failure, I've paused the DAG until this can be addressed.

AetherUnbound avatar Sep 05 '24 18:09 AetherUnbound