ckanext-dcat icon indicating copy to clipboard operation
ckanext-dcat copied to clipboard

RDF job never ends if some dataset raises exception in gather stage

Open pduchesne opened this issue 5 years ago • 0 comments

What happens : a DCAT RDF feed is harvested, and fails with

[ckanext.harvest.model] Error when processsing dataset: KeyError('title',) / Traceback (most recent call last):
  File "/home/ckan/ckan/sources/ckanext-dcat/ckanext/dcat/harvesters/rdf.py", line 211, in gather_stage
     dataset['name'] = self._gen_new_name(dataset['title'])
  KeyError: 'title'
[ckanext.harvest.queue] No harvest objects to fetch

obviously because one of the datasets is missing a title, and the code does not expect that. But the problem is that the job is never marked as finished, and stays pending.

Possible explanation Looking at https://github.com/ckan/ckanext-dcat/blob/db7ab41e77ccd1724025fed4f30c9485ad007a4f/ckanext/dcat/harvesters/rdf.py#L233-L241 we see that any dataset error results in an empty array to be returned. But also that other HarvestObject may be created and saved before the error happens.

Is it possible that these HarvestObject are never marked as in error, left in limbo and cause the 'harvest job run' to consider the failed job still running ? That's my impression when looking at this :

https://github.com/ckan/ckanext-harvest/blob/5aad13c2f9aba738a82eeca8bb7a859e584f483b/ckanext/harvest/logic/action/update.py#L522-L534

pduchesne avatar Mar 01 '19 12:03 pduchesne