ckanext-dcat
ckanext-dcat copied to clipboard
RDF job never ends if some dataset raises exception in gather stage
What happens : a DCAT RDF feed is harvested, and fails with
[ckanext.harvest.model] Error when processsing dataset: KeyError('title',) / Traceback (most recent call last):
File "/home/ckan/ckan/sources/ckanext-dcat/ckanext/dcat/harvesters/rdf.py", line 211, in gather_stage
dataset['name'] = self._gen_new_name(dataset['title'])
KeyError: 'title'
[ckanext.harvest.queue] No harvest objects to fetch
obviously because one of the datasets is missing a title, and the code does not expect that. But the problem is that the job is never marked as finished, and stays pending.
Possible explanation Looking at https://github.com/ckan/ckanext-dcat/blob/db7ab41e77ccd1724025fed4f30c9485ad007a4f/ckanext/dcat/harvesters/rdf.py#L233-L241 we see that any dataset error results in an empty array to be returned. But also that other HarvestObject may be created and saved before the error happens.
Is it possible that these HarvestObject are never marked as in error, left in limbo and cause the 'harvest job run' to consider the failed job still running ? That's my impression when looking at this :
https://github.com/ckan/ckanext-harvest/blob/5aad13c2f9aba738a82eeca8bb7a859e584f483b/ckanext/harvest/logic/action/update.py#L522-L534