ckanext-xloader icon indicating copy to clipboard operation
ckanext-xloader copied to clipboard

Possible limit in xloader submit all: only processes the first 1000 datasets?

Open dsanchezmatilla opened this issue 7 months ago • 3 comments

Description

When I run: ckan xloader submit all

It seems to only submit xloader jobs for the first 1000 datasets in the CKAN instance. Any datasets beyond that number don’t appear to be processed.

Expected behavior (unless this is intentional)

I expected this command to submit jobs for all datasets, but it currently seems limited to the first 1000.

What might be happening

It looks like this behavior comes from the use of the package_search API without pagination. By default, that API only returns up to 1000 results.

Here’s the line that seems to be causing it:

https://github.com/ckan/ckanext-xloader/blob/c1670106b0774a1340b8b976ba14c0509fc871a3/ckanext/xloader/command.py#L50

Since there’s no loop or pagination logic around this, only the first 1000 datasets are processed.

Suggestion (if not intentional)

If this limit is by design, maybe it would be helpful to document it.
Otherwise, a possible fix would be to add a loop that paginates through all datasets in batches of 1000 using the start and rows parameters in the API call.

Steps to reproduce

  1. Set up a CKAN instance with more than 1000 datasets.
  2. Run ckan xloader submit all.
  3. Only 1000 datasets will have jobs submitted.

Additional notes

Just wanted to raise this in case it's unintentional — happy to help if needed!

dsanchezmatilla avatar May 28 '25 16:05 dsanchezmatilla

That doesn't sound like intentional design, thanks for picking it up.

happy to help if needed!

If you want to submit a pull request, I'll certainly take a look.

ThrawnCA avatar May 28 '25 23:05 ThrawnCA

Looking into this, this only affects the CLI Click commands.

@xloader.command()
@click.argument(u'dataset-spec')
@click.option('-y', is_flag=True, default=False, help='Always answer yes to questions')
@click.option('--dry-run', is_flag=True, default=False, help='Don\'t actually submit any resources')
@click.option('--queue', help='Queue name for asynchronous processing, unused if executing immediately')
@click.option('--sync', is_flag=True, default=False,
              help='Execute immediately instead of enqueueing for asynchronous processing')
def submit(dataset_spec, y, dry_run, queue, sync):
    """
        xloader submit [options] <dataset-spec>
    """
    cmd = XloaderCmd(dry_run)

    if dataset_spec == 'all':
        cmd._setup_xloader_logger()
        cmd._submit_all(sync=sync, queue=queue)
    elif dataset_spec == 'all-existing':
        _confirm_or_abort(y, dry_run)
        cmd._setup_xloader_logger()
        cmd._submit_all_existing(sync=sync, queue=queue)
...

It does not affect individual resource create/updates as it already has the full resource set from primary processing.

Partial work around:

If you already have datasets in the system and want to refresh all, use the 'all-existing' option instead of 'all'.

duttonw avatar May 29 '25 00:05 duttonw

I created a pull request to fix the issue: https://github.com/ckan/ckanext-xloader/pull/252

dsanchezmatilla avatar May 29 '25 14:05 dsanchezmatilla