maintenance_tasks icon indicating copy to clipboard operation
maintenance_tasks copied to clipboard

How do we set `limit`?

Open bray opened this issue 2 years ago • 4 comments

When collection returns an ActiveRecord::Relation, how do we set limit for each "batch"? job-iteration's Best Practices says it defaults to 100. 100 is too small for expensive queries and/or large datasets.

It looks like that's passed in here.

But for maintenance_tasks, in the collection method, we're not explicitly building an enumerator. So how can we pass in or set the @batch_size? I've tried some guesses like @batch_size, @of, def batch_size, def of inside the MaintenanceTasks::Task.

I also tried returning a few different classes in the collection method, but none have worked:

#find_each(batch_size: 2000) gets us what we want, but this returns an Enumerator and #collection can't be an Enumerator (it gives the error "#collection must be either an Active Record Relation, ActiveRecord::Batches::BatchEnumerator, Array, or CSV.").

#in_batches(batch_size: 2000) works (it returns an ActiveRecord::Batches::BatchEnumerator), but this is only if you want to process a batch in process, rather than individual items.

#in_batches(batch_size: 2000).each_record also returns an Enumerator.

bray avatar Sep 14 '23 18:09 bray

I'd say this is purely a missing API as there is no reason not to allow tasks to redefine the batch size that is being used under the hood. The batch size will need to be passed to the active_record_on_records enumerator using the batch_size: option https://github.com/Shopify/maintenance_tasks/blob/e1765da66ff0c870891c8608a7260f8a8292ac0e/app/jobs/concerns/maintenance_tasks/task_job_concern.rb#L42

So I'll try to put it on our roadmap or PRs are always welcomed! Unless someone else from maintainers remembers/has concerns on why this feature should not exist

The only requirement for this feature is that the tasks shouldn't be forced to always specify the batch size and it still should default to 100 (or to nil which means we'll use the default from job-iteration)

nvasilevski avatar Sep 14 '23 19:09 nvasilevski

100 is too small for expensive queries and/or large datasets.

I'm curious if you have any examples where a query may benefit from a larger batch size. Or is it just about reducing the number of queries performed by the task overall. It is irrelevant to the need of such API, it's just we often hear the opposite where people are trying to lower the batch size to make sure that database searches for less records per query and returns the result quicker.

I also just realized that we need to be careful and make sure people don't use way to big batch sizes because for example if the task uses batches of 1000 to load records but each process method call takes 1s to finish the task will never go through whole batch before being interrupted due to default interruption is configured to happen in 120s so tasks essentially will be throwing most of the loaded records while only processing ~120 of 1000 loaded

UPD: Not sure why I assumed that job-iteration sets max_job_runtime to 2 minutes by default but the point is still valid. batch_size should be aligned to fit max_job_runtime to avoid loading records that won't be processed in a given run

nvasilevski avatar Sep 14 '23 20:09 nvasilevski

Ok, thanks - good to know I wasn't missing something. I'll try to get up a PR!

An example would be: say you have to process 50 million records. If you fetch 100 per query, that's 500,000 queries. As opposed to 50,000 or 10,000 if you fetch 1k or 5k per query. If the query takes negligibly longer for the larger batch size, it's probably better to do a larger batch size to greatly reduce the number of queries in a potentially small time window. This is presuming each process is, e.g. only updating the record so takes a max of tens of ms.

That's a good point re: max_job_runtime. To be honest, I've set that to infinity to disable the periodic interruptions. I understand the tasks should be interruptible, but don't really see the benefit of forcing interrupts every 5 minutes. The reasoning given here is to preserve worker capacity. But we'll be running these tasks in their own Sidekiq queue with dedicated workers, so that's not a concern for us.

So in our case, a big batch size is irrelevant. But I understand we have to keep that in mind for changes to the gem.

bray avatar Sep 14 '23 22:09 bray

This issue has been marked as stale because it has not been commented on in two months. Please reply in order to keep the issue open. Otherwise, it will close in 14 days. Thank you for contributing!

github-actions[bot] avatar Jan 25 '24 01:01 github-actions[bot]

Maple and I hacked a bit on this during RailsConf Hack day and came up with a simple proof of concept: https://github.com/Shopify/maintenance_tasks/tree/custom-relation-limit-size

Thoughts on the API @nvasilevski @bray ? Opted for a task-level DSL collection_limit

adrianna-chang-shopify avatar May 13 '24 20:05 adrianna-chang-shopify

I'm a little worried that any name with _limit_ in it may mislead users into thinking that this is the max number of records the task is supposed to iterate over. Personally I would prefer having _batch_ in the name

But the prototype looks clean!

nvasilevski avatar May 13 '24 21:05 nvasilevski

I agree with Nikita. Looks good, but s/limit/batch_size/g :)

bray avatar May 13 '24 22:05 bray

Naming is hard 😅 I was trying to stay clear of batch in the API initially because I was worried it would cause confusion with the batch relation enumerator. But that's the kwarg job-iteration uses so it's probably clearer than limit 👍 Okay, I'll open up a PR with some tests. Thanks for the feedback, y'all!

adrianna-chang-shopify avatar May 14 '24 12:05 adrianna-chang-shopify