Possible limit in xloader submit all: only processes the first 1000 datasets?
Description
When I run:
ckan xloader submit all
It seems to only submit xloader jobs for the first 1000 datasets in the CKAN instance. Any datasets beyond that number don’t appear to be processed.
Expected behavior (unless this is intentional)
I expected this command to submit jobs for all datasets, but it currently seems limited to the first 1000.
What might be happening
It looks like this behavior comes from the use of the package_search API without pagination. By default, that API only returns up to 1000 results.
Here’s the line that seems to be causing it:
https://github.com/ckan/ckanext-xloader/blob/c1670106b0774a1340b8b976ba14c0509fc871a3/ckanext/xloader/command.py#L50
Since there’s no loop or pagination logic around this, only the first 1000 datasets are processed.
Suggestion (if not intentional)
If this limit is by design, maybe it would be helpful to document it.
Otherwise, a possible fix would be to add a loop that paginates through all datasets in batches of 1000 using the start and rows parameters in the API call.
Steps to reproduce
- Set up a CKAN instance with more than 1000 datasets.
- Run
ckan xloader submit all. - Only 1000 datasets will have jobs submitted.
Additional notes
Just wanted to raise this in case it's unintentional — happy to help if needed!
That doesn't sound like intentional design, thanks for picking it up.
happy to help if needed!
If you want to submit a pull request, I'll certainly take a look.
Looking into this, this only affects the CLI Click commands.
@xloader.command()
@click.argument(u'dataset-spec')
@click.option('-y', is_flag=True, default=False, help='Always answer yes to questions')
@click.option('--dry-run', is_flag=True, default=False, help='Don\'t actually submit any resources')
@click.option('--queue', help='Queue name for asynchronous processing, unused if executing immediately')
@click.option('--sync', is_flag=True, default=False,
help='Execute immediately instead of enqueueing for asynchronous processing')
def submit(dataset_spec, y, dry_run, queue, sync):
"""
xloader submit [options] <dataset-spec>
"""
cmd = XloaderCmd(dry_run)
if dataset_spec == 'all':
cmd._setup_xloader_logger()
cmd._submit_all(sync=sync, queue=queue)
elif dataset_spec == 'all-existing':
_confirm_or_abort(y, dry_run)
cmd._setup_xloader_logger()
cmd._submit_all_existing(sync=sync, queue=queue)
...
It does not affect individual resource create/updates as it already has the full resource set from primary processing.
Partial work around:
If you already have datasets in the system and want to refresh all, use the 'all-existing' option instead of 'all'.
I created a pull request to fix the issue: https://github.com/ckan/ckanext-xloader/pull/252