safe-pg-migrations icon indicating copy to clipboard operation
safe-pg-migrations copied to clipboard

Deprecate BatchOver in favor of in_batches(use_ranges: true)

Open maximerety opened this issue 4 months ago • 3 comments

Starting from ActiveRecord 7.1, there's a built-in helper equivalent to what BatchOver does, let's use it instead of maintaining our own implementation forever.

We need to keep BatchOver as long as compatibility with ActiveRecord < 7.1 is maintained though.

See:

  • https://api.rubyonrails.org/classes/ActiveRecord/Batches.html#method-i-in_batches
  • https://github.com/rails/rails/blob/v7.1.0/activerecord/CHANGELOG.md
  • https://github.com/rails/rails/pull/45414
  • https://github.com/rails/rails/commit/620f24782977b8e53e06cf0e2c905a591936e990

If using ActiveRecord 7.1 or later, we would use the recommended built-in method in_batches with the use_ranges: true option, e.g.

User.in_batches(of: 100, use_ranges: true).each { |batch| ... }

Otherwise, we would still use BatchOver as a fallback:

SafePgMigrations::Helpers::BatchOver.new(User, of: 100).each_batch { |batch| ... }

Note that although both helpers are almost equivalent, there are small differences in the queries generated.

With the example code above, and assuming the users tables contains 250 records, we would have with BatchOver:

/* Get batch #1 */
SELECT "users".* FROM "users" ORDER BY "users"."id" ASC LIMIT 1
SELECT "users".* FROM "users" ORDER BY "users"."id" ASC LIMIT 1 OFFSET 100
SELECT "users".* FROM "users" WHERE "users"."id" >= 1 AND "users"."id" < 101 ORDER BY "users"."id" ASC
/* Do something with result */

/* Get batch #2 */
SELECT "users".* FROM "users" WHERE "users"."id" >= 101 ORDER BY "users"."id" ASC LIMIT 1
SELECT "users".* FROM "users" WHERE "users"."id" >= 101 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 100
SELECT "users".* FROM "users" WHERE "users"."id" >= 101 AND "users"."id" < 201 ORDER BY "users"."id" ASC
/* Do something with result */

/* Get batch #3 */
SELECT "users".* FROM "users" WHERE "users"."id" >= 201 ORDER BY "users"."id" ASC LIMIT 1
SELECT "users".* FROM "users" WHERE "users"."id" >= 201 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 100
SELECT "users".* FROM "users" WHERE "users"."id" >= 201 ORDER BY "users"."id" ASC
/* Do something with result */

/* No more batches */

Whereas in_batches(of: 100, use_ranges: true) would give:

/* Get batch #1 */
SELECT "users"."id" FROM "users" ORDER BY "users"."id" ASC LIMIT 100
SELECT "users".* FROM "users" WHERE "users"."id" <= 100 
/* Do something with result */

/* Get batch #2 */
SELECT "users"."id" FROM "users" WHERE "users"."id" > 100 ORDER BY "users"."id" ASC LIMIT 100 
SELECT "users".* FROM "users" WHERE "users"."id" > 100 AND "users"."id" <= 200
/* Do something with result */

/* Get batch #3 */
SELECT "users"."id" FROM "users" WHERE "users"."id" > 200 ORDER BY "users"."id" ASC LIMIT 100
SELECT "users".* FROM "users" WHERE "users"."id" > 200 AND "users"."id" <= 250
/* Do something with result */

/* No more batches */

It would work exactly the same if passing any ActiveRecord::Relation object in place of the model User, e.g. User.where(condition: 'something'), with the additional condition appearing in the WHERE clause of each query.

For the selection of a range of record ids (id min / id max), BatchOver will generate two queries per batch, returns full records (and not only ids), but retrieves only 2 records. Conversely, in_batches(use_ranges: true) will generate a single query per batch, return only ids (and not full records), but would return the full list of ids instead of the min/max only.

I believe that trade-off is acceptable for our purpose, but that is debatable.

maximerety avatar Feb 22 '24 21:02 maximerety