maintenance_tasks icon indicating copy to clipboard operation
maintenance_tasks copied to clipboard

Provide support for enumerable collections beyond Array and ActiveRecord::Relation in TaskJob

Open adrianna-chang-shopify opened this issue 4 years ago • 2 comments

Will be partially solved when CSV functionality is implemented, after which we will decide whether further collections should be supported.

adrianna-chang-shopify avatar Nov 30 '20 14:11 adrianna-chang-shopify

👋 I was going to open a new issue, but this seemed like a more appropriate place to chime in.


TL;DR: Lack of support for custom enumerators prevents certain use cases, such as processing external resources.


I was investigating adopting this gem for our app (Shopify/athena), and support being limited to ActiveRecord::Relation & Array (and CSVs, once that is added) would block us from adopting it for many of our use cases.

Athena manages many resources in Twilio. One of the things we do is run a monthly job which iterates over certain resources which are over a month old and deletes them. For this, we use Shopify/job-iteration, and a custom enumerator which "streams" in all the results for our search, and iterates over them until they're all gone. In some other jobs which don't delete the resources, we use their start_time as an increasing cursor.

This works because JobIteration supports custom Enumerators. Given the current implementation, we would have to pre-load all the resources (or IDs) into an Array before starting to process them, or some other workaround.

It would be great if MaintenanceTasks supported custom enumerators, similarly to JobIteration!

sambostock avatar Jan 16 '21 22:01 sambostock

As per the discussion in https://github.com/Shopify/maintenance_tasks/pull/307#discussion_r561420323, I'm going to expand on some of our use cases, with respect to processing API resources in Twilio.

Most of these tasks run on a schedule, though we have had one-off tasks come and go.

  • Synchronization tasks: We have tasks which iterate through resources in Twilio, typically using a timestamp field (created_at, updated_at, event_date, etc.) as a cursor, which we want to upsert into our SQL "mirror" database
  • Deletion tasks: We delete various resources as they exit our retention period. Some of these are "mirrored" in our database, so we enumerate over the records, deleting the resources and their record, one by one. For resources which are not mirrored, we must iterate via the API, again with some time based cursor, but this time with extra parameters to narrow the time range.
  • One-off "backfill" tasks:
    • After adding code to automatically move call recordings from Twilio into GCS once ready, we had to iterate through existing recordings and move them over.
    • After adding some code to populate a worker attribute after worker creation, we had to backfill the attribute for all existing workers.

In these examples, we want to iterate over a set of resources

  • not available as an ActiveRecord::Relation
  • potentially large enough to want to avoid eagerly building an Array
  • potentially expanding over the duration of the enumeration (i.e. as we process resources, more may be created)
  • potentially large enough to warrant cursor based resumption after interruption
  • requiring a custom cursor

This strikes me as a good use case for building a custom enumerator, based on:

  • MaintenanceTasks having no knowledge of the resource type, or how to query it
  • MaintenanceTasks having no knowledge of how to construct a suitable cursor
  • eagerly dumping the entire collection into an Array being expensive

sambostock avatar Jan 21 '21 02:01 sambostock

@sambostock @rafaelfranca I see a few PRs open that would add the support for enumerators. I think https://github.com/Shopify/maintenance_tasks/pull/326 is especially close to what I would need, but I also notice they haven't seen activity in over 2 years.

Is there still an appetite to tackle this?

lavoiesl avatar Aug 11 '23 21:08 lavoiesl