maintenance_tasks icon indicating copy to clipboard operation
maintenance_tasks copied to clipboard

Concurrent task iteration support

Open joshbeckman opened this issue 3 years ago • 10 comments

Over at https://github.com/shopify/flow we have been trying to adopt the maintenance task framework and have enjoyed the benefits for our small data migrations, but our main hangup is the loong runtimes of tasks that need to operate on large datasets (e.g. all records in one table table - tens of thousands now, will be much more the future). When we tried running a recent data migration via maintenance task recently, the total time to execute would have been months.

As such, our main desire with this library would be declarative concurrency support. Is https://github.com/Shopify/maintenance_tasks/issues/325#issuecomment-776012605 still the recommendation for concurrency in the future of this library?

No immediate need for action on this - we just wanted to provide feedback on our adoption!

joshbeckman avatar May 26 '21 21:05 joshbeckman

Do you think batches enumerators could help? (see #409) Depending on your tasks, being able to update 100/1000 records at a time could substantially speed them up.

Regarding actual parallelism when running tasks, it's something that we're thinking about but we haven't made any formal plans so we can't make any promise. We can keep this issue open to continue thinking about it, start fleshing out an API, behaviour, figure out the edge cases (e.g. it will require special handling for custom enumerators which may not have a way to start a cursor randomly, but only give out one item at a time), etc.

etiennebarrie avatar May 27 '21 14:05 etiennebarrie

Batches could help with some of our task types, yes!

But we have other types of tasks that require, for example, calling an external API with an individual record and then saving that value to our database, so the batching would remove some of the overhead of the job queue itself but wouldn't give us the speed up that we would get from concurrency.

joshbeckman avatar May 27 '21 15:05 joshbeckman

I recently ran a migration on flow which mainly involves making graphQL requests to core for certain things. Processing 874k rows would take about 7 days to complete. I think allowing parallelism really helps in these cases.

Screenshot 2023-03-29 at 1 14 33 PM

sle-c avatar Mar 29 '23 18:03 sle-c

This issue has been marked as stale because it has not been commented on in two months. Please reply in order to keep the issue open. Otherwise, it will close in 14 days. Thank you for contributing!

github-actions[bot] avatar Jan 27 '24 01:01 github-actions[bot]

We would still like this!

joshbeckman avatar Jan 29 '24 16:01 joshbeckman

This issue has been marked as stale because it has not been commented on in two months. Please reply in order to keep the issue open. Otherwise, it will close in 14 days. Thank you for contributing!

github-actions[bot] avatar Mar 31 '24 01:03 github-actions[bot]

We would still really like this

joshbeckman avatar Apr 01 '24 13:04 joshbeckman

This would be incredibly useful.

segiddins avatar Apr 16 '24 17:04 segiddins