maintenance_tasks Add support for dynamic collections

Context

Currently, the framework supports two/three types of collections:

Instances of Array (delegating to enumerator_builder.build_array_enumerator(collection, cursor: cursor))
Instances of ActiveRecord::Relation (delegating to enumerator_builder.active_record_on_records(collection, cursor: cursor))
CSVs (WIP as far as I can tell) (probably using JobIteration::CSVEnumerator)

This presents problems for use cases such as:

Tasks which need to process external resources (requiring custom enumeration)
Tasks which need to process active records in batches

Proposed Solution

This PR adds support for the collection method to return an object responding to .call(cursor:), which when called should return a JobIteration style enumerator (successively yielding a pair containing an item to process, and a cursor identifying it).

This alternative was chosen, as it is non-breaking, and provides a "we'll handle the simple cases, you handle the advanced ones" approach for consumers. Of course, feedback is welcome!

Alternatives Considered

Return an `Enumerator` from `Task#collection` in special cases

While we could check if the result of collection is an Enumerator already, and forward it to just return it from TaskJob#build_enumerator, we have no way of injecting the cursor, meaning the enumerator would have no way to resume from a certain spot.

We could do something like optionally pass cursor: to collection, but that would require checking its parameters and would result in a weird API.

Require `collection` to always return an `Enumerator`

This would be a breaking change to the API, and would complicate the simple use cases (relations & arrays), since currently they don't need to know anything about cursors.

Check another method first

We could first check if CustomTask#enumerator exists, and if so, call it with cursor:, falling back to the current collection call otherwise, but this would mean the methods defined by Tasks would differ from each other. That said, perhaps this makes sense if we introduce the concept of a StaticTask and a DynamicTask, and move the enumerator creation into the task itself. 🤔

For Reviewers

I have split the PR into three commits with detailed commit messages. In short though:

Add Maintenance::DynamicTask dummy task for tests: stands alone to isolate unrelated changes
Add support for dynamic collection: this is where the feature is actually added
Add dynamic collection parameter introspection: this adds some more rigorous checking of the arguments accepted by the "callable", so we can produce a better error message if it doesn't accept a cursor: keyword argument. I'm not sure this is required, so I left it on its own so it can easily be dropped depending on reviewer feedback 👌

⚠️ Before Merging

[ ] Rebase on latest main
[ ] Squash fixup commits (and maybe others)
[ ] Document feature in README
[ ] Address TODO comments

Related to #207

Jan 20 '21 22:01 sambostock

The name is not working for me, what is more dynamic about this collection than others? It's just a task based on enumerator/cursor rather than the simple case of collection. Actually now that I think of it, that's what bothers me about this, we shouldn't make it about a collection being dynamic, but more about a task that's based on an enumerator instead of a collection.

So from the alternatives mentioned, I think "Check another method first" is the one that makes the most sense to me.

The current solution feels kind of hijacking the collection method, to actually return a callable instead of a collection, and then we have a some amount of code to handle that case: does it respond to call? is it directly a method/proc or just an object? and what are his parameters?

I don't think we should be doing all of this. Using another message would solve all of this: does the task respond to enumerator, if it does we're good, we just send it the cursor, and pass that back to job iteration.

There's another argument that Rafael makes as to whether we want this entirely or not. Most of the time users shouldn't have to deal with a cursor, and we can solve specific features like batches by providing an API. But I think it makes sense for other things like APIs where you have a since, after or before cursor, that can't work without passing a cursor to the task.

Jan 21 '21 16:01 etiennebarrie

Thanks for the reviews @adrianna-chang-shopify & @etiennebarrie!

Could we also add a section to the README for this feature? I would document it at the bottom in its own section so we can leave the focus on the simple use cases like Rafael mentioned.

💯 I've added it to my "Before Merging" checklist, but I'll hold off on writing it until we settle on the implementation/API.

I'm not a huge fan of the parameter introspection,

Yeah, I'm not either. 😅 However, the reason I didn't want to rescue ArgumentError and re-raise with a custom error message is that erroneously replacing an unrelated ArgumentError with a misleading error message would be even worse than an unhelpful one.

Removing the check or switching to the def enumerator(cursor:) would address the complexity increase, so I'll do one of those two. 👍

does the task respond to enumerator, if it does we're good, we just send it the cursor, and pass that back to job iteration.

This approach does come with one trade off: the API is slightly blurred. Consumers are now expected to implement exactly one of two different API methods. How do we make this clear to the consumer and provide useful feedback if they implement it incorrectly?

For example, we document the collection method

# Placeholder method to raise in case a subclass fails to implement the
# expected instance method.
#
# @raise [NotImplementedError] with a message advising subclasses to
#   implement an override for this method.
def collection
  raise NotImplementedError,
    "#{self.class.name} must implement `collection`."
end

if we document Task#enumerator in the same way it breaks task.respond_to?(:enumerator). If we don't document it, then we haven't documented it. We also don't really have a way to make them mutually exclusive, though I don't know if that's an issue.

One possible alternative would be to move the enumerator building from TaskJob into this new enumerator method

class TaskJob
  attr_reader :cursor
  def build_enumerator(_run, cursor:)
    @cursor = cursor || @run.cursor
    @task.enumerator(job: self) # delegate, instead of building enumerator here
  end
end

class Task
  # Documentation for special case consumers would go here
  def enumerator(job:)
    collection = self.collection
    
    case collection
    when ActiveRecord::Relation
      job.enumerator_builder.active_record_on_records(collection, cursor: job.cursor)
    when Array
      job.enumerator_builder.build_array_enumerator(collection, cursor: job.cursor)
    else
      raise ArgumentError, "#{self.class.name}#collection must be either "\
        'an Active Record Relation, or an Array'
    end
  end
  
  # collection method unchanged
end

# Backwards compatible with existing API
class MySimpleTask < Task
  def collection() = Model.all
  def count() = Model.count
  def process(model) = puts "Processed #{model}"
end

# Supports customization
class MyCustomTask < Task
  def enumerator(job:) = SomeCustomEnumerator.new('whatever', cursor: job.cursor)
  def process(item) = puts "Processed custom enumerated #{item}"
end

This is slightly different to the discussion started in https://github.com/Shopify/maintenance_tasks/pull/307#discussion_r561364421, but @rafaelfranca's comments about revealing implementation details still may still apply. That said, it

simplifies the method calls (no need for .respond_to?)
is entirely backwards compatible
reveals no complexity to consumers with simple use cases

Most of the time users shouldn't have to deal with a cursor, and we can solve specific features like batches by providing an API.

Yeah, I agree with you and @rafaelfranca that we can provide batch functionality via a simple API, instead of requiring the consumer to build the custom enumerator. 👍 I'll consider that out of scope for this PR.

name is not working for me

Yeah, names are hard... 😅 I chose the name because it was the best I could come up with to describe collection returning a "callable" instead of directly returning the collection. Maybe that alone is a giveaway that the API isn't ideal.

Glad to rename to something else, especially if we change the approach.

Jan 21 '21 19:01 sambostock

@adrianna-chang-shopify another alternative I considered was to simply have two different classes one could inherit from, which have different APIs: one has enumerator "not implemented", and the other implements it and introduces "collection". This might be awkward though, since we'd be inserting a class into the existing hierarchy, and might have to do weird stuff to accommodate DescendantsTracker. Nonetheless, we could do something like this

class EnumeratorTask
  def enumerator(context:) = raise NotImplementedError
  def count = nil
  def process(item) = raise NotImplementedError
end

class CollectionTask < EnumeratorTask
  def enumerator(context:)
    # ...build enumerator around collection
  end
  
  def collection = raise NotImplementedError
end
Task = CollectionTask

Alternatively, we could stick to the two single methods, and implement Task.method_added to track the addition of one method and raise if the other is added too (similarly to how JobIteration prevents perform from being redefined).

Jan 22 '21 15:01 sambostock

Actually @sambostock, what do you think of having custom enumerator functionality coming in the form of a Module? This is what we're doing for Tasks that need to process CSVs (we opted not to use inheritance for the exact reason of avoiding extra complexity with DescendantsTracker).

(See https://github.com/Shopify/maintenance_tasks/blob/7131371ad6e7cf0604ad54f27b6a01b7914dea56/app/tasks/maintenance_tasks/csv_task.rb)

Something like this:

module EnumeratorTask
  def enumerator(cursor:) = raise NotImplementedError # Assuming we're able to drop use of context and just use cursor
  def collection = nil
end

class MyCustomTask < Task
 include EnumeratorTask
  def enumerator(cursor:) = SomeCustomEnumerator.new('whatever', cursor: cursor)
  def process(item) = puts "Processed custom enumerated #{item}"
end

Jan 22 '21 16:01 adrianna-chang-shopify

@adrianna-chang-shopify how do you see that being used compared to a standard Task within the framework? Would we keep the logic in TaskJob?

def build_enumerator(_run, cursor:)
  cursor ||= @run.cursor
  return @task.enumerator(cursor: cursor) if @task.is_a?(EnumeratorTask)

  collection = @task.collection
  case collection
  # ...
end

I worry that this feels a little like respond_to?(:enumerator), but with extra steps... 😅

Jan 22 '21 16:01 sambostock

@sambostock I am very much in favour of your suggestion to move enumerator (and the logic to build enumerators using JobIteration's API) to Task!

So my proposition of using a mixin wouldn't really change anything with your most recently proposed solution, it would just be a way to define the API more clearly for Enumerator Tasks, and a way to ensure that these Tasks are adhering to defining enumerator properly.

# task_job.rb (No changes)
def build_enumerator(_run, cursor:)
  @task.enumerator(cursor: cursor)
end

# task.rb (No changes)
def enumerator(cursor:)
    collection = self.collection

    case collection
    when ActiveRecord::Relation
      JobIteration.enumerator_builder.active_record_on_records(
        collection,
        cursor: cursor,
      )
    when Array
      JobIteration.enumerator_builder.build_array_enumerator(
        collection,
        cursor: cursor,
      )
    when CSV
      JobIteration::CsvEnumerator.new(collection).rows(cursor: cursor)
    else
      raise ArgumentError, "#{self.class.name}#collection must be either "\
        'an Active Record Relation, an Array, or a CSV.'
    end
end

# enumerator_task.rb
module EnumeratorTask
  def enumerator(cursor:)
   # Tasks that include this Module must define their own #enumerator
    raise NotImplementedError, "#{self.class.name} must implement `enumerator(cursor:)`." 
  end
  def collection
    nil
  end
end

# my_custom_task.rb
class MyCustomTask < Task
 include EnumeratorTask
  def enumerator(cursor:)
    SomeCustomEnumerator.new('whatever', cursor: cursor)
  end

  def process(item)
    puts "Processed custom enumerated #{item}"
  end
end

And then:

#another_custom_task.rb
class AnotherCustomTask < Task
 include EnumeratorTask
  def process(item)
    puts "Processed custom enumerated #{item}"
  end
end

AnotherCustomTask.new.enumerator(cursor: 1)
> NotImplementedError: MaintenanceTasks::MyCustomTask must implement `enumerator(cursor:)`.

So, any Tasks that includes EnumeratorTask would need to provide their own definition for enumerator, but any other type of Task that doesn't include the mixin would just delegate enumerator to the method in Task.

What do you think?

Jan 22 '21 17:01 adrianna-chang-shopify

Seems like a stale version of https://github.com/Shopify/maintenance_tasks/pull/944 now, so I am going to close this for now.

Jan 19 '24 20:01 gmcgibbon

Add support for dynamic collections

Context

Proposed Solution

Alternatives Considered

Return an Enumerator from Task#collection in special cases

Require collection to always return an Enumerator

Check another method first

For Reviewers

⚠️ Before Merging

Return an `Enumerator` from `Task#collection` in special cases

Require `collection` to always return an `Enumerator`