openverse
openverse copied to clipboard
Stocksnap 403 forbidden error during ingestion
Airflow log link
Note: Airflow is currently only accessible to maintainers & those given access. If you would like access to Airflow, please reach out to a member of @WordPress/openverse-maintainers.
https://airflow.openverse.org/log?execution_date=2024-03-01T00%3A00%3A00%2B00%3A00&task_id=ingest_data.pull_image_data&dag_id=stocksnap_workflow&map_index=-1
Description
The Stocksnap DAG encountered an error during ingestion:
[2024-04-01, 23:09:02 UTC] {requester.py:85} ERROR - Error with the request for URL: https://cdn.stocksnap.io/img-thumbs/960w/LZITOLMWL6.jpg
[2024-04-01, 23:09:02 UTC] {requester.py:86} INFO - HTTPError: 403 Client Error: Forbidden for url: https://cdn.stocksnap.io/img-thumbs/960w/LZITOLMWL6.jpg
[2024-04-01, 23:09:02 UTC] {requester.py:89} INFO - Using headers {'User-Agent': 'Openverse/0.1 (https://openverse.org; [email protected])', 'Accept': 'application/json'}
[2024-04-01, 23:09:02 UTC] {media.py:233} INFO - Writing 68 lines from buffer to disk.
[2024-04-01, 23:09:02 UTC] {provider_data_ingester.py:513} INFO - Committed 31168 records
[2024-04-01, 23:09:02 UTC] {taskinstance.py:2728} ERROR - Task failed with exception
providers.provider_api_scripts.provider_data_ingester.IngestionError: 403 Client Error: Forbidden for url: https://cdn.stocksnap.io/img-thumbs/960w/LZITOLMWL6.jpg
query_params: {}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
return execute_callable(context=context, **execute_callable_kwargs)
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 200, in execute
return_value = self.execute_callable()
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 217, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/opt/airflow/catalog/dags/providers/factory_utils.py", line 55, in pull_media_wrapper
data = ingester.ingest_records()
File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 276, in ingest_records
raise error from ingestion_error
File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 241, in ingest_records
self.record_count += self.process_batch(batch)
File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 455, in process_batch
if not (record_data := self.get_record_data(data)):
File "/opt/airflow/catalog/dags/providers/provider_api_scripts/stocksnap.py", line 92, in get_record_data
filesize = self._get_filesize(url)
File "/opt/airflow/catalog/dags/providers/provider_api_scripts/stocksnap.py", line 152, in _get_filesize
resp = self.delayed_requester.head(image_url)
File "/opt/airflow/catalog/dags/common/requester.py", line 114, in head
return self._make_request(self.session.head, url, **kwargs)
File "/opt/airflow/catalog/dags/common/requester.py", line 70, in _make_request
response.raise_for_status()
File "/home/airflow/.local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://cdn.stocksnap.io/img-thumbs/960w/LZITOLMWL6.jpg
On top of this, Stocksnap uses a page counter instead of normal query params, so it's difficult to determine which page it failed on:
https://github.com/WordPress/openverse/blob/852e6b7d852728a7213f860d0a8657f06c584e00/catalog/dags/providers/provider_api_scripts/stocksnap.py#L45
In addition to resolving this issue, we should try and alter the DAGs that don't normally use query parameters so they still have something to report when they fail.
DAG status
Unchanged for now since this is a monthly DAG
I've opened #4102 to help us reproduce this. Once that's merged, we should run the DAG again and see if it fails in the same place. If it does, we can continue to troubleshoot. If it doesn't, we can close this and reopen if it comes up again.
Confirmed (now that initial_query_params works!) that this fails locally when starting with the params {"page": 780}. Locally by the time I tested, the error was actually happening on page 781, possibly because more records were added before I tested.
I have emailed Stocksnap about this issue.