good_job icon indicating copy to clipboard operation
good_job copied to clipboard

Support for Batches?

Open mollerhoj opened this issue 2 years ago • 21 comments

Considering migrationg from Sidekiq Pro: Does this gem support some equivalent of sidekiq batches?

Thats is, can I set a job to run once a particular set of other jobs have finished?

mollerhoj avatar Oct 10 '21 13:10 mollerhoj

@mollerhoj thanks for asking! There is not currently built-in batching functionality in GoodJob.

Can you give some examples of how you're using batching? It's likely possible to orchestrate it outside of GoodJob in your application. And if there is broad interest/application, I'd consider adding the functionality to GoodJob.

Also, I'm curious generally about why you're considering migrating from Sidekiq Pro.

bensheldon avatar Oct 10 '21 13:10 bensheldon

Sure. We're extracting data from a REST API where every resource in a collection (of say 20.000 items) has it's own endpoint. The pipeline is as follows:

    1. Fetch a list of items
    1. Send a request to each items endpoint.
    1. Register that the job is done.

This is a common and embarrassingly parallel task thats completely IO bound, and thus much faster to do with many jobs rather than a single one.

Our main issue with sidekiq is the reliance on redis - It's much simpler to keep everything in postgres rather than having to sync the knowledge (of what job has run/is running/is connected to a particular model etc) of our job system db and our CRUD db manually.

mollerhoj avatar Oct 10 '21 16:10 mollerhoj

That's helpful. I'm wondering if this sort of interface would be sufficient?

class FanOutJob < ApplicationJob
  include GoodJob::ActiveJobExtensions::Batch

  after_batch do |id:, key:|
    id  # => a UUID; all the jobs enqueued in the same batch will have the same UUID
    key # => a String that is entirely within your control to set data, serialize/etc. 
        #    to pass some data/state to the completion of the batched set of jobs

    CrudModel.find_by(key).do_something_after_the_batch_finished
  end

  def perform(*args)
    # do the work
  end
end

batch_job_args = [args_for_job_1, args_for_job_2, args_for_job_3]
FanOutJob.perform_later_in_batch(batch_job_args, key: the_crud_model.id) # => returns the batch_id, maybe?

In the implementation I'm imagining, the only state that would be passed forward to the batch completion would be an arbitrary key (the String), and a UUID (which would be used by the backend to trigger the after_batch callback once all the jobs with the same UUID had completed).

Some questions I have:

  • How would you handle errors? e.g. if one of the jobs errors, should that job/error be passed to the after_batch handler? Should successful jobs be passed too? Is the ActiveJob instance objects enough (they won't have a record of the instance objects), or would you want the GoodJob::Execution itself (which is kinda-sorta a private API).
  • Is a single "after every job in the batch finishes/errors" enough? Are there other lifecycle events that would be required?

Implementation: I'm imagining the most naive implementation in which batch_id and batch_key are added as columns to the good_jobs table of Executions. If there is a batch_id set, after the job finishes (e.g. in the same thread as the job), the extension would check whether all the other jobs had finished, and if so, trigger the callback. e.g. the very last job in the batch would be the one that then triggered the execution of the after_batch callback.

bensheldon avatar Oct 11 '21 01:10 bensheldon

Above looks like a fine start. I suggest we see if this could be implemented without monkey patching (avoiding perform_later_in_batch) above.

Batches is a very general feature in the sense that jobs can either run sequentially or parallel. A batch feature would allow use to create complex workflow to switch between those two modes. For an introduction to the topic, I'd read: https://github.com/mperham/sidekiq/wiki/Batches, https://github.com/mperham/sidekiq/wiki/Complex-Job-Workflows-with-Batches and https://github.com/mperham/sidekiq/wiki/Really-Complex-Workflows-with-Batches. For a guide to what problems can arise when working with batches.

In my experience, Sidekiq Batches are not the easiest to work with. We should give the API some thought before implementation. In particular, I would like a solution with a:

Workflow API

A nice, natural api for defining workflows (in sidekiq it requires a lot of boilerplate). A lot of extensions try to solve this (https://github.com/mperham/sidekiq/wiki/Related-Projects#execution-ordering) but an official solution would be a lot better.

Workflows are DAGs. In the python community the library prefect (a successor to Airflow) seems to a have nice solution. They allow both specifying the DAG in a functional and an imperative manner: https://docs.prefect.io/core/concepts/flows.html#functional-api.

Support for Testing

A solution for testing full workflows. While individual jobs can be tested, it's close to impossible to test that everything is wired up correctly. (https://github.com/mperham/sidekiq/issues/2700)

Proposed API

I think your "naive" approach above looks good. I'd suggest we implement manually it without any actual extensions to goodjob. Then we can see after what we might want to add to GoodJob for convenience / performance reasons.

With regards to exception handling, I think that logic can be handled by the user. And I don't think more life cycle events are needed, they will only complicate things.

I'm not very familiar with how GoodJobs store jobs in postgres, so I might have guessed wrong below:

Here is a proposed implementation that requires no extensions to GoodJob (we can build from there?)

class FanOutJob < ApplicationJob
  def perform(batch_id)
     @batch_id = batch_id
  
     # ... Perform work here, we can even add more jobs to the batch dynamically with:
     # `ChildJob.perform_later(batch_id: @batch_id)`
     
     finalize_batch if final_job?
  end
  
  def final_job?
    batch_jobs.size == 1
  end
  
  def batch_jobs
     FanOutJob.where(serialized_params: { batch_id: @batch_id})
  end
  
  def finalize_batch
     if batch_jobs.where(failed: true).present?
        finalize_failed_batch if defined?(finalize_failed_batch)
     else
        finalize_successful_batch if defined?(finalize_successful_batch)
     end
  end
end

Batch.new do 
  ChildJob.perform_later
  ChildJob.perform_later
  ChildJob.perform_later
end

The methods in the Job above could be extracted to a include GoodJob::ActiveJobExtensions::Batch module, where perform was also overriden to wrap the setting of batch_id and finalize_batch call. Thus you could do something like:

class FanOutJob < ApplicationJob
  include GoodJob::Batch

  def perform
     # ... Perform work here, we can even add more jobs to the batch dynamically with:
     # `ChildJob.perform_later(batch_id: @batch_id)`
  end
  
  def finalize_successful_batch
    puts "success!"
  end
end

My dream would then be to also have some sort of Workflow DSL to specify complex workflow:

Workflow.new do
  run JobA
  run JobB
  run JobC
  run JobD, after: [JobA, JobB]
  after [JobC, JobB] do
     run JobE
  end
end

An a way of testing the workflow

Workflow.run
expect(JobA).to have_run.before(JobD)
expect(JobB).to have_run.before(JobD)
expect(JobD).to have_run.after([JobA, JobB])

expect(JobA).to have_run.before(JobE)
expect(JobB).to have_run.before(JobE)
expect(JobC).to have_run.before(JobE)
expect(JobD).to have_run.before(JobE)
expect(JobE).to have_run.after([JobA, JobB, JobC, JobD])

I'd be happy to make a PR for something like this if you approve of the API

mollerhoj avatar Oct 11 '21 08:10 mollerhoj

@mollerhoj Thanks for collaborating on this! I am excited to get something working here.

There's a lot of stuff to cover, so I want to focus on what would make this feature a good fit for GoodJob, and what would need to change if GoodJob were to support the functionality:

  • GoodJob enhancement: Transactional batch enqueue. All the job records need to hit the database at the same time. Otherwise, one job might be enqueued, worked, and the batch finalized before the other jobs that are intended to be part of the batch hit the database.
  • Architectural question: where is the "there" that inter-batch steps take place (i.e. detect that all the jobs in a batch step have finished, and trigger the next step)? From my perspective as GoodJob maintainer, those can't be jobs themselves for a feature of GoodJob (jobs live in application-land). Also It's going to be racy operation, which is why I think it will be necessary to hook into GoodJob more deeply (e.g. it likely will need to be wrapped with an advisory lock in a similar way to the Concurrency extension).
  • Philosophy. Extending ActiveJob. GoodJob is an ActiveJob backend. I would prefer not to introduce other objects into application-land. The GoodJob::ActiveJobExtensions::Concurency is the template/mechanism I'd like to preserve for additional extensions. It's a conceptual constraint to think about how to squish it into ActiveJob and I think that's a good thing.
  • Product. I want to solve your problem asap, while not constraining future problem-solving. Hence something simple

bensheldon avatar Oct 11 '21 14:10 bensheldon

@mollerhoj Thanks for collaborating on this! I am excited to get something working here.

There's a lot of stuff to cover, so I want to focus on what would make this feature a good fit for GoodJob, and what would need to change if GoodJob were to support the functionality:

  • GoodJob enhancement: Transactional batch enqueue. All the job records need to hit the database at the same time. Otherwise, one job might be enqueued, worked, and the batch finalized before the other jobs that are intended to be part of the batch hit the database.

Right, I guess the best would be to have some kind of a staging area in the enqueue with jobs that are scheduled but do not get run + Some mechanism for moving jobs with a specific batch_id from staging and into the actual queue. ActiveJob already has the Job.set(wait_until: 10.years.from_now), so perhaps Job.set(batch: some_batch) is the way to go?

  • Architectural question: where is the "there" that inter-batch steps take place (i.e. detect that all the jobs in a batch step have finished, and trigger the next step)? From my perspective as GoodJob maintainer, those can't be jobs themselves for a feature of GoodJob (jobs live in application-land). Also It's going to be racy operation, which is why I think it will be necessary to hook into GoodJob more deeply (e.g. it likely will need to be wrapped with an advisory lock in a similar way to the Concurrency extension).

Ah yes, just had a look at that extension. Nice to see that we already have tools for locking etc. Reusing the lockable concern should make things easier.

  • Philosophy. Extending ActiveJob. GoodJob is an ActiveJob backend. I would prefer not to introduce other objects into application-land. The GoodJob::ActiveJobExtensions::Concurency is the template/mechanism I'd like to preserve for additional extensions. It's a conceptual constraint to think about how to squish it into ActiveJob and I think that's a good thing.

Hm, let's see how this turns out. I'd suggest we aim for an API that is nice enough the rails might consider upstreaming - but it's not clear to me if thats possible without adding an additional "Batch" class. (OOP tends to get pretty with more classes, not fewer) We'll see.

  • Product. I want to solve your problem asap, while not constraining future problem-solving. Hence something simple Don't worry, this is somewhat into the future - we use sidekiq for now...

mollerhoj avatar Oct 12 '21 07:10 mollerhoj

Oh an one more thing: https://github.com/breamware/sidekiq-batch is an open source implementation of batches - reading it might help us deal with the concurrency issues

mollerhoj avatar Oct 12 '21 07:10 mollerhoj

@mollerhoj Thanks for sharing those resources. I, unfortunately, am not available to do speculative interface building; GoodJob already has quite a roadmap. I also don't want to waste your time by having you architect something that is not a good fit for GoodJob.

If you would like to address the blocker that is preventing you from adopting GoodJob, I can make time to work on that. This is why I want to focus directly on your use case.

You described what I interpreted as a fan-out of jobs and a trigger when the jobs have finished. If that meets your needs, let's focus on that.

When the trigger happens, what data/arguments will the trigger need to pass through to the code that is called after the batch of jobs finishes:

  • what state/data will need to have been passed through from the initial batch construction? (i.e. "a batch of what?")
  • what status from the run jobs do you need? (e.g. counts, errors, the jobs themselves, etc.)

bensheldon avatar Oct 12 '21 15:10 bensheldon

I would just need some way of knowing what batch just finished, e.g. a batch_id. In my specific case, My customers can have a number of Shoppers, each using a FanOutJob to fetch items. I would need some way of figuring out which shopper just finished fetching items. I would also need which items failed to be fetched, and what exceptions caused the fetching to fail.

In general, I would need the ability for a job to spawn other jobs, and a trigger to know when all the spawned jobs have finished. This will allow me to build pipelines of jobs the leverages concurrency. I've linked to the resources above not to burned you, but to show you that this is a very general feature that most job managing systems eventually try to tackle.

mollerhoj avatar Oct 12 '21 17:10 mollerhoj

@mollerhoj Thanks! Just want to keep things moving forward.

I put together a proof-of-concept for an interface. Please let me know what you think. The code is also fully runnable too if you want to download the file ($ ruby batch_jobs.rb):

https://gist.github.com/bensheldon/5d3add207b7c5e689682c29ebbc506ab#file-batch_jobs-rb-L73-L97

The PoC objects is two jobs .

  • PrepareProduceJob which preps an individual kind of produce and has defined a :make_fruit_salad kind of batch
  • MakeSaladJob which runs after a batch of produce has been prepared with the :make_fruit_salad key

The implementation is nasty, ignore it. Really I'd love the feedback on the interface and whether it would meet your needs.

bensheldon avatar Oct 13 '21 03:10 bensheldon

Woah, that was fast, good going! Except for adding more params (a below), I believe this interface might solve the job I'm currently looking at. However, I need to implement more workflows (we pull data from quite a few places). Here are some things that this interface lack:

  • (a) The ability to have more than one param per job.
  • (b) The ability to mix job types in a single batch.
  • (c) The ability to add jobs to a batch batch on some dynamic logic (i.e. add jobs to a batch in different places in the code).
  • (d) The ability to add jobs to a batch while its running.

Having all the params for jobs in a list is also quite cumbersome. For our case, we'd have a list of say, 10000 lists of parameters. Adding parameters directly to jobs in the code is a more natural api. See my proposal for an API below. I don't think it requires many changes in terms of the actual implementation.

### --- PROPOSED INTERFACE BELOW -- ###

class PrepareProduceWithTool < ApplicationJob
  def perform(produce, tool)
    if produce.in? %w[apple banana grape]
      @current_batch << PrepareProduce.later(produce) # (d)
    else
      puts "Preparing #{produce} using #{tool}"
    end
  end
end

class PrepareProduce < ApplicationJob
  def perform(produce)
    if produce.in? %w[apple banana grape]
      puts "Preparing #{produce}"
    else
      raise "Ewww, not a #{produce}!"
    end
  end
end

class MakeSalad < ApplicationJob
  def perform(produces)
    puts "Making salad with #{produces.join(', ')}"
  end
end

# You seem to not want to add a new class. Here is a proposed interface that allows for the above without adding a class:
jobs = []
jobs << { klass: MakeSalad, params: ['apple']}
jobs << { klass: MakeSalad, params: ['shoe']}
jobs << { klass: MakeSaladWithTool, params: ['coconut', 'hammer']} # (a) (b)
jobs << { klass: MakeSalad, params: ['banana']} if rand(2) == 0 # (c)


ApplicationJob.perform_later_in_batch(jobs) do
  prepared_fruits = jobs.reject { |job| job.error_message.present? }.map { |job| job.arguments.first }
  MakeSalad.perform_later(prepared_fruits)

  discarded_fruits = jobs.select { |job| job.error_message.present? }.map { |job| job.arguments.first }
  puts "Discarding #{discarded_fruits.join(', ')}"
end

puts "Enqueued #{jobs.count} jobs with batch_id: #{jobs.first.batch_id}"
sleep 2 # wait for batch to finish in the background

# Here is an interface that uses a Batch class:

my_batch = GoodJob::Batch.new
my_batch.add_job(MakeSalad, ["apple"])
my_batch.add_job(MakeSalad, ["shoe"])
my_batch.add_job(MakeSaladWithTool, %w[coconut hammer]) # (a) (b)
my_batch.add_job(MakeSalad, ["banana"]) if rand(2) == 0 # (c)
my_batch.start do |jobs|
  prepared_fruits = jobs.reject { |job| job.error_message.present? }.map { |job| job.arguments.first }
  MakeSalad.perform_later(prepared_fruits)
  discarded_fruits = jobs.select { |job| job.error_message.present? }.map { |job| job.arguments.first }
  puts "Discarding #{discarded_fruits.join(', ')}"
end

mollerhoj avatar Oct 13 '21 07:10 mollerhoj

I updated the gist: https://gist.github.com/bensheldon/5d3add207b7c5e689682c29ebbc506ab#file-batch_jobs-rb-L98-L102

(a) The ability to have more than one param per job.

I updated the example to show this is possible.

(b) The ability to mix job types in a single batch.

I updated the signature of ActiveJob::Base.perform_later_in_batch to take args or ActiveJob instances.

(c) The ability to add jobs to a batch batch on some dynamic logic (i.e. add jobs to a batch in different places in the code).

Can you say more about this? Are you wanting to add jobs to a single batch from different processes/threads (e.g. they need to be stored in the database/interprocess but not started) or is passing around an in-memory Array and shoveling objects/args into it good enough?

(d) The ability to add jobs to a batch while its running.

What should happen if the batch has already finished? Should enqueue raise an exception, or is it ok if the after-batch handler is called multiple times?

This is my ideal imagining of what is possible, but I'm trying to validate that it would work for you. Figure 1 is how I'm imagining. Figure 2 is what I'd like to avoid (adding jobs to an existing batch; calling the after-batch handler multiple times). Figure 3 is a recognition that there will likely be a higher-level object eventually "Workflow" that can tie multiple batches together (but let's shove that off until the future).

Untitled drawing (1)

I'm really pushing here for:

  • a batch has to be started from in-memory objects (e.g. collect them up in an array, then enqueue)
  • a batch can't be added-to once the batch is enqueued
  • and therefore, the after-batch handler will only be called once

I am pushing for this because it most easily fits with the existing GoodJob architecture:

  • it doesn't require adding another kind of object/table in the database to track batch state (e.g. started/finished)
  • it doesn't require changing any of the job-scheduling logic to make a stored-but-unstartable job

Obviously if it doesn't work for you, let's not do it the most easy/simple way.

One more set of thoughts/questions:

I notice you leaning into this object of a Batch object. I'm a little confused by the lifecycle of this object, which is why I keep pushing back against it.

In my mind, the after-batch handler/block has to be a static set of Ruby instructions (e.g. attached to a class variable), because I'm imagining the code being triggered and run potentially hours/days later and in a different scope/thread/process/server.

But I notice that you want to have that handler dynamically live on an instance in the current context. Are you imagining that it would block the current execution until the batch finishes? Otherwise I don't see how that scope/code would be preserved.

bensheldon avatar Oct 13 '21 15:10 bensheldon

I updated the gist: https://gist.github.com/bensheldon/5d3add207b7c5e689682c29ebbc506ab#file-batch_jobs-rb-L98-L102

(a) The ability to have more than one param per job.

I updated the example to show this is possible.

(b) The ability to mix job types in a single batch.

I updated the signature of ActiveJob::Base.perform_later_in_batch to take args or ActiveJob instances.

I'd suggest simplifying the interface by only taking ActiveJob instances, not free flowing params.

Also, now that batch can mix jobs, it does not seem to make sense to define the after_batch callback inside a specific Job class, does it? (This is one of the reasons why a batch object makes sense*).

(c) The ability to add jobs to a batch batch on some dynamic logic (i.e. add jobs to a batch in different places in the code).

Can you say more about this? Are you wanting to add jobs to a single batch from different processes/threads (e.g. they need to be stored in the database/interprocess but not started) or is passing around an in-memory Array and shoveling objects/args into it good enough?

In-memory is good enough, so fine as it is.

(d) The ability to add jobs to a batch while its running.

What should happen if the batch has already finished? Should enqueue raise an exception, or is it ok if the after-batch handler is called multiple times?

As shown in my example, jobs can only be pushed into the currently running batch from a job in the batch. Thus the batch cannot have finished, as a job in it is still running. This is a very common thing to do, as you don't know how many jobs a batch might have to complete - Other job managers eventually run in to this (something called dynamic graph building).

This is my ideal imagining of what is possible, but I'm trying to validate that it would work for you. Figure 1 is how I'm imagining. Figure 2 is what I'd like to avoid (adding jobs to an existing batch; calling the after-batch handler multiple times). Figure 3 is a recognition that there will likely be a higher-level object eventually "Workflow" that can tie multiple batches together (but let's shove that off until the future).

Untitled drawing (1)

I'm really pushing here for:

  • a batch has to be started from in-memory objects (e.g. collect them up in an array, then enqueue)
  • a batch can't be added-to once the batch is enqueued
  • and therefore, the after-batch handler will only be called once

I am pushing for this because it most easily fits with the existing GoodJob architecture:

  • it doesn't require adding another kind of object/table in the database to track batch state (e.g. started/finished)
  • it doesn't require changing any of the job-scheduling logic to make a stored-but-unstartable job

Obviously if it doesn't work for you, let's not do it the most easy/simple way.

This all sounds reasonable. Though adding more jobs to a batch from a running job of that batch should be doable without adding any complex state etc.

One more set of thoughts/questions:

I notice you leaning into this object of a Batch object. I'm a little confused by the lifecycle of this object, which is why I keep pushing back against it.

The batch isn't persisted or has a specific lifecycle. It's simply a convenience object that replaces the array you a using. So the batch object is really just an Array of jobs. Thus:

batch_objects = [["apple", 4], ["shoe", 2]]
batch_objects << PrepareProduce.new( "banana")
batch_objects << PrepareProduceSomeOtherJob.new( "grape", 2)

jobs = PrepareProduce.perform_later_in_batch(:prepare_fruit_salad, batch_objects)

becomes

batch = GoodJob::Batch.new
batch << PrepareProduce.new( "apple", 4)
batch << PrepareProduce.new( "shoe", 2)
batch << PrepareProduce.new( "banana")
batch << PrepareProduceSomeOtherJob.new( "grape", 2)
jobs = PrepareProduce.perform_later_in_batch(:prepare_fruit_salad, batch)
  • Now, as mentioned above, it's strange to have the after_batch callback defined inside the PrepareProduce class, as the batch might have many different jobs running, why would the after_batch be defined inside one of them? It seems to me to be more sensible to define the callback on this new batch object:
class MyBatch < GoodJob::Batch
   after_batch do |jobs|
      #....
   end
end

batch = MyBatch.new
batch << PrepareProduce.new( "apple", 4)
batch << PrepareProduce.new( "shoe", 2)
batch << PrepareProduce.new( "banana")
batch << PrepareProduceSomeOtherJob.new( "grape", 2)
jobs = batch.start

That's the reasoning behind the batch object. Notice that the object is not persisted anywhere, it's simply a way to make the code cleaner.

In my mind, the after-batch handler/block has to be a static set of Ruby instructions (e.g. attached to a class variable), because I'm imagining the code being triggered and run potentially hours/days later and in a different scope/thread/process/server.

But I notice that you want to have that handler dynamically live on an instance in the current context. Are you imagining that it would block the current execution until the batch finishes? Otherwise I don't see how that scope/code would be preserved.

Good points, the callback should be defined in the Batch class, not inline.

mollerhoj avatar Oct 14 '21 11:10 mollerhoj

@mollerhoj I'm revisiting this to do some more scoping before I start development.

I was looking over the Sidekiq Batch API Documentation. I'm curious, with this example, where you define SomeClass in your project? (I'm assuming you're already using Sidekiq Batches)

class SomeClass
  def on_complete(status, options)
    puts "Uh oh, batch has failures" if status.failures != 0
  end
  def on_success(status, options)
    puts "#{options['uid']}'s batch succeeded.  Kudos!"
  end
end
batch = Sidekiq::Batch.new
# this will call "SomeClass.new.on_success"
batch.on(:success, SomeClass, 'uid' => current_user.id)
# You can also use Class#method notation which is like calling "AnotherClass.new.method"
batch.on(:complete, 'AnotherClass#method', 'uid' => current_user.id)

bensheldon avatar Feb 05 '22 01:02 bensheldon

@bensheldon I'm not sure I understand the question correctly. What do you mean by where? In what file? We currently use the concept of an app/models/manager because we need somewhere to store persist info about the jobs in our main db (not redis). They the managers holds the callbacks also.

This setup is probably not worth copying, as GoodJob actually uses postgres to store the jobs.

mollerhoj avatar Feb 05 '22 18:02 mollerhoj

Yep, where the file/class/directory lives. The thing I'm still working out, and trying to become comfortable with, is where the callback/success code lives. It has to be somewhere static/committed.

bensheldon avatar Feb 05 '22 18:02 bensheldon

I'd love to see batch support too, it would allow me to migrate away from Sidekiq Pro! I'd also like to help if I can.

is where the callback/success code lives

For Sidekiq, the idea is that the callback and success code can be on any object anywhere. Maybe it's a model that toggles a Report#generated boolean, or in another job that queues up another batch, or in a service that makes an HTTP call. It might even be a public method on the Job class itself (probably most common).

danielwestendorf avatar Feb 18 '22 16:02 danielwestendorf

Yep, where the file/class/directory lives. The thing I'm still working out, and trying to become comfortable with, is where the callback/success code lives. It has to be somewhere static/committed.

Is there any reason this should be hardcoded? You could just recommend something like app/batches/.. put there's reason to enforce such an api is there?

It might even be a public method on the Job class itself (probably most common).

^ Although this isn't very pretty when a batch contains different types of jobs.

mollerhoj avatar Feb 18 '22 20:02 mollerhoj

Turns out that PHPs laravel framework as a very well designed job queue built into the framework with a comprehensive set of features for batching. Worth checking out: https://laravel.com/docs/9.x/queues

mollerhoj avatar Jul 03 '22 10:07 mollerhoj

@mollerhoj thanks for sharing that! It looks like Laravel serializes the callback closures into the database, and then executes them later. I don't think Ruby has a safe equivalent of that: something like storing a lambda's code as a string in the database and then eval'ing it later.

From my perspective, how/where the callback code is defined/structured is 98% of this feature.

I took a look at sidekiq-batch, and realized that the callback methods are defined on the instance, rather than statically on the class. E.g.

# doing this...
batch.on(:success, SomeClass, 'uid' => current_user.id)

# ...ends up invoking this by Sidekiq
SomeClass.new.on_success(batch_status, 'uid' => 42)

I think defining them as instance methods would imply that these are single-purpose classes, as there is no way to define how the instance is initialized. Does that seem right? This is why I'm curious about how the supporting code/callbacks for batches are organized.

bensheldon avatar Jul 07 '22 00:07 bensheldon

You're welcome! Some thoughts:

  • Completely agree that serializing the callback is no good. Imagine you have a bug - now your bug is serialized to the database.. Or you make an update that renders the serialized code invalid. Tons of issues with that implementation.

  • I agree that the API is important, but maybe considering the callback placement 98% of the effort is to underestimate the Difficulties with batching. Consider this statement from Mike Perham, creator of Sidekiq: "Batches continue to be my favorite feature of all time and the one I am most proud of." (https://www.mikeperham.com/2020/01/08/faktory-enterprise/)

  • sidekiq-batch is supposed to be a drop-in replacement from sidekiq pro's batches, but I think sidekiq pro has since been redesigned as Mike learnt more about the issues that can arise with batches. (I'll elaborate on this below)

  • My experience with Sidekiq Pro has been that it's nice to have a system that I can trust has been shaped by many real world cases, so I get to stand on the shoulders of giants. However, I've had to learn a bunch of things from looking through Issues, the code of sidekiq and generally the github repo. The documentation in the Wiki leaves much to be desired. Sidekiq is sorta like git - you have to learn the underlaying structure to understand it, knowing the API is not enough.

  • One important thing to consider: Callbacks can fail, throw exceptions etc. I can't seem to find the issue in Sidekiq repo, but Mike mentions that this is why he redesigned Callbacks to be jobs themselves. An idea might be to lean into that implementation completely, by requiring a new ActiveJob class for the callback. That would be a really simple interface:

# doing this...
batch.on(:success, SomeJob, 'uid' => current_user.id)

# ...ends up invoking this: app/jobs/some_job.rb
SomeJob.new.perform(batch_status, 'uid' => 42)
  • Not to repeat myself, but it can make sense to have a batch with jobs of different classes:
batch = [
 AJob.new(a),
 BJob.new(b)
]

batch.on(:success, x, 'uid' => current_user.id) # should x be AJob or BJob? 

Thus I believe having the batches involve a "perform" on an entirely new Job Class is a clearer and more minimal API.

mollerhoj avatar Jul 07 '22 07:07 mollerhoj

I have an implementation of Batches I would like feedback on. The full PR is in #712. Here is a link to the proposed Readme documentation: https://github.com/bensheldon/good_job/blob/batches_tdd/README.md#batches

bensheldon avatar Oct 06 '22 15:10 bensheldon