data-migrate
data-migrate copied to clipboard
Does not work with Rails multiple databases
When I configure my Rails 6.1 app to use multiple databases (via built in functionality), data-migrate doesn't work.
bin/rake data:schema:load
my-project/db/secondary_db_data_schema.rb doesn't exist yet. Run `rake data:migrate` to create it, then try again.
but running rake data:migrate
does not create it.
Is it a known limitation? Do you plan to support multiple databases?
I am facing the same issue, did you find any workaround to make this work?
me too
@lewhit, i forgot, did you have experience with multi-databases in data-migrate? do you have any ideas on what's happening here?
@ilyakatz the gem is trying to update the data_migrations table of the secondary database. the data migration is for the primary database. so the gem seems to be unaware that multiple databases can exist.
We have this same issue. Has anyone found a workaround?
edit: We were able to get this working with an initializer:
require "data_migrate"
module DataMigrate::OnlyPrimaryDatabase
def each_current_configuration(environment, name = nil)
# Ignore secondary database(s)
super(environment, name || "primary")
end
end
DataMigrate::DatabaseTasks.extend DataMigrate::OnlyPrimaryDatabase
DataMigrate.configure do |config|
config.db_configuration = Rails.configuration.database_configuration[Rails.env]["primary"]
end
I have a branch that appears to properly run migrations across all configured connections, in the same way that schema migrations work. I don't have a lot of time right now to test and fix specs, but if anyone else wants to contribute we could get a pull request submitted once this is cleaned up https://github.com/chaunce/data-migrate/tree/multiple_connection_support
I have fixed the specs and created a pull request. Any testing and feedback anyone can offer would be helpful https://github.com/ilyakatz/data-migrate/pull/249
Thanks for working on this issue! I'm testing the version 10.0.0 of the data_migrate
gem, and have some questions.
I created a very simple Rails app for demonstration: https://github.com/martingregoire/data-migrate-multi-db
It has a primary database with a Post
model, and a secondary database with a Comment
model.
Now I'd like to add a data migration that is run on the Post
model, and another one on the Comment
model. So I do this:
$ rails g data_migration DoSomethingWithPosts --database primary
create db/data/20230608080311_do_something_with_posts.rb
$ rails g data_migration DoSomethingWithComments --database secondary
create db/data/20230608080320_do_something_with_comments.rb
Then to run the data migrations:
$ rails db:migrate:with_data
== Data =======================================================================
== 20230608080311 DoSomethingWithPosts: migrating =============================
== 20230608080311 DoSomethingWithPosts: migrated (0.0024s) ====================
== Data =======================================================================
== 20230608080320 DoSomethingWithComments: migrating ==========================
== 20230608080320 DoSomethingWithComments: migrated (0.0018s) =================
== Schema =====================================================================
== 20230608074207 CreatePosts: migrating ======================================
-- create_table(:posts)
-> 0.0010s
== 20230608074207 CreatePosts: migrated (0.0010s) =============================
== Data =======================================================================
== 20230608080311 DoSomethingWithPosts: migrating =============================
== 20230608080311 DoSomethingWithPosts: migrated (0.0004s) ====================
== Data =======================================================================
== 20230608080320 DoSomethingWithComments: migrating ==========================
== 20230608080320 DoSomethingWithComments: migrated (0.0004s) =================
What happens is that
- each data migration is run twice
- a schema migration is run, but on the wrong database
The schema migration creates a posts
table in the secondary database:
$ sqlite3 db/primary.sqlite3 .tables
ar_internal_metadata
data_migrations
schema_migrations
posts
$ sqlite3 db/secondary.sqlite3 .tables
ar_internal_metadata
data_migrations
schema_migrations
comments
posts
In db/data_schema.rb
we see this: DataMigrate::Data.define(version: 20230608080320)
In db/secondary_data_schema.rb
this: DataMigrate::Data.define(version: 20230608080320)
Am I doing something wrong? Do I have to use a different directory for the data migrations for the secondary database?
The creation of the posts
table on the secondary bug looks like a bug to me.
And I would expect that for the primary database only the DoSomethingWithPosts
data migration was run (setting the version to 20230608080311
in db/data_schema.rb
), and that for the secondary database only the DoSomethingWithComments
data migration was run.
hey folks, there was a recent by @chaunce update to support multiple databases https://github.com/ilyakatz/data-migrate/pull/249, does this address your concerns?
It looks like this was tested on version 10.0.0, so it should include the new support for multiple databases.
Regarding your two issues: each data migration is run twice
The test cases added were configured on a mock application using horizontal sharding [https://guides.rubyonrails.org/active_record_multiple_databases.html#horizontal-sharding]. In this case, the data migrations may need to run on all connections. For example, if the User
class has records across connections :shard_0, :shard_1, and :shard_2, the data migration will need to run on each of these.
On the other hand, if you have role-based connections [https://guides.rubyonrails.org/active_record_multiple_databases.html#activating-automatic-role-switching] you may not want the data migrations to run on all connections. For example, if the User
class has a :read connection and a :write connection, you may only want to run the data migration on the :write connection. But connection roles are specified at the AR level, and can even differ from model to model, so it's not inherently clear if or when some connections should be skipped for a data migration.
Since data migrations piggy-back off the scheme migration tasks/code, the implementation I added uses the same configuration, and a data migration that is not built for a specific connection will run on all connections. In the data migration itself, you should be able to control what data and how the data on the current connection is migrated.
I didn't test connection-specific data migrations myself (those created using a --database db_name
flag), we may need some additional code to support this.
a schema migration is run, but on the wrong database
This may be a bug, the quickest way to determine is probably to drop the db, remove the data-migrate
gem, then run the schema migration again. If the behavior is the same, this is what is expected. If the behavior is different with data-migrate
installed then a bug needs to be fixed.
I can look into these issues sometime, but I don't know exactly when..
Thanks @chaunce!
each data migration is run twice
Our use case is neither sharding nor role-based connections; we simply use multiple databases for separation of the storage of records. Imagine the main, important data in the primary database, and some not so important data in the secondary database, which has lower performance or availability.
Do you happen to have an example of how to check for the current connection inside a data migration file?
a schema migration is run, but on the wrong database
Without the data-migrate
gem this does not happen: the schema migrations are only run on the correct database.
I created these schema migrations using the --database
option:
rails g model Post title:string --database primary
rails g model Comment body:string --database secondary
Running rails db:migrate
will create a posts
table in the primary DB and a comments
table in the secondary DB.
After adding the gem, running rails db:migrate:with_data
will create a posts
table in the secondary DB.
Hey everyone, at Gusto, we rely heavily on this gem. As of version 10, things just stopped working completely for us. @wildmaples and I have been debugging and have realized the implementation for multiple databases is...not quite right. We think that you actually don't need each database to maintain it's own data migration versions at all. We just need the single primary db to keep track since data migrations can make changes to all databases already.
All the issues that folks are seeing (eg see below), can probably be addressed in other ways.
bin/rake data:schema:load
my-project/db/secondary_db_data_schema.rb doesn't exist yet. Run `rake data:migrate` to create it, then try again.
At this point we're not sure what the intention of creating a data_migrations
table in every single database is for since there isn't a dedicated data migrations directory for each database. This is why folks are seeing data migrations running multiple times -- it's once for every database you have.
We have two paths forward:
- Give each database its own data migration directory. But I think this concept is overkill. You don't need to segregate your data migrations because it's just a ruby file that executes arbitrary ruby code. And as such, you can make modifications to any database.
- Revert the changes in v10 that gives each database its own
data_migrations
table and go back to only the primary database keeping track of data migrations. We'll just have to fix the original problem that this issue was point out (which should be easy).
We've reached out to @ilyakatz to schedule a Zoom call to figure out what the intention was so we know how to resolve this.
Having a data_migrations
table on every database is necessary for the same reason a schema_migrations
table is on every database. If all your databases match you might not need these, but if they don't you probably will. There are some bugs that I plan to look into, but if we remove data_migrations
from all but one table then this gem won't work for anyone that does not have identical horizontal shading and only ever wants to run data migrations AR records stored in a database where the connection is configured with a :write
role.
Hi folks, I just re-read the comments and I sounds like people use multiple databases for different usecases
- sharding - all databases have the same tables with same structure but data is loaded across different databases
- data separation - databases have different tables with unrelated data
Are there other usecases?
From what I can tell, @chaunce solved for sharding usecase, correct?
I didn't test connection-specific data migrations myself (those created using a --database db_name flag), we may need some additional code to support this.
I can see why this could be a problem as people are expecting to use it the same way as it's used in Rails, but technically it's not a requirement that it this gem has to operate the same way although not ideal.
After adding the gem, running rails db:migrate:with_data will create a posts table in the secondary DB.
this sounds like a big problem
So where do we go from here?
- First step, I'll make a PR that will add a description that will consider version 10.0.x breaking and people should test before upgrading. I'll publish it as a release candidate version in rubygems
- Since I haven't been involved in Rails world for along time, I'm looking for help maintaining this gem. Thanks to @chaunce for working on multi-support, but looks like it would need some more work. I know you mentioned that you may not have much time to work on this. However, @ngan and @wildmaples (from Gusto) have expressed interest in working on this. Since you represent different usecases, it would be great to get your minds together. @chaunce, would you be able to get on a call to chat with us? If so, could you shoot me an email and I'll you to the discussion (my email is my name @ gmail). If you don't think you will have time to work on this, I think it's only fair to let Gusto folks take the lead (which may require a rollback/re-release).
- If anyone else has interests in participating on the call, please reach out.
cc @opti I know you were active more recently, see discussion above
Some feedback on the previous comment:
it sounds like people use multiple databases for different usecases .. Are there other usecases? The only other major strategies I'm aware of is the separation of Reads and Writes, and non-standard schemas.
However, there's also the combination of multiple strategies. I can use horizontal shading for client data (all the same tables across all databases), but also have Application Specific data only one connection, and separate connections for Reads and Writes. In Rails we just name the connections, but there's nothing inherently different about a Read Only connection and a Write connection (at least currently https://edgeguides.rubyonrails.org/active_record_multiple_databases.html#load-balancing-replicas) except that you configure it that way. Connections can be configured at the application level (ApplicationRecord), or for each Active Record model individually, so each model can have its own connection configuration.
For example:
Using the following database.yml
production:
primary:
database: primary
username: root
primary_replica:
database: primary
username: root_readonly
replica: true
extended:
database: extended
username: root
migrations_paths: db/extended
client_a:
database: client_a
username: client_a
migrations_paths:
- "db/migrate"
- "db/clients"
- "db/client_a"
client_a_replica:
database: client_a
username: client_a_readonly
replica: true
client_b:
database: client_a
username: client_a
migrations_paths:
- "db/migrate"
- "db/clients"
- "db/client_b"
client_b_replica:
database: client_a
username: client_a_readonly
replica: true
And [example - maybe non-functional] application code:
class ApplicationRecord < ActiveRecord::Base
self.abstract_class = true
connects_to database: { writing: :primary, reading: :primary_replica } # role specific connections
end
class Clients < ApplicationRecord
end
class ClientDetails
connects_to database: { writing: : extended, reading: : extended }
belongs_to :client
end
class Widgets < ApplicationRecord
belongs_to :client
scope for_client(client) do
ActiveRecord::Base.connected_to(role: :reading, shard: :"client_#{client.name}") do
all
end
end
end
So all clients all exist on : primary
, they are all replicated to : primary_replica
, but : client_a
, : client_a_replica
, :client_b
, and :client_b_replica
also have the clients
table and may have this data replicated to all these as well, all by way of migrations in db/migrate
[default].
The client_details
table only exists on :extended
by way of db/extended
.
The widgets
table exists on : client_a
, : client_a_replica
, :client_b
, and :client_b_replica
by way of db/clients
.
And migrations in db/client_a
and db/client_b
add custom columns to tables on :client_a
and, separately with different custom columns, :client_b
.
Last, if db/migrate
has 30 schema migrations, these can each be run on each individual connection at different times. So db/client_a
can be at v26 and db/client_b
at v30, meaning maybe both different tables and different columns.
This is all fully supported in Rails currently. This might be an uncommon configuration now, but I imagine will become more common in the future.
I can see why this could be a problem as people are expecting to use it the same way as it's used in Rails, but technically it's not a requirement that it this gem has to operate the same way although not ideal.
The way I see it there are two options:
- we follow the patterns of Rails and allow data migrations to run on specific connections, thereby preserving the ability to run data and schema migrations in sequence in anyone's rails application
- we don't, and data migrations run as a separate task
If I want to be able to run rake db:migrate:with_data
then the data migrations need to follow the same patterns as the schema migrations. This is not the only way to migrate the data, but that's how the gem currently works, and so that's how I applied the changes in my pull request.
If we do continue to follow the rails migration patterns, we'll have to account for the complex configuration shown above, and part of that will be a data_migrations
table on each database, among other minor changes.
this sounds like a big problem
I agree, this should be fixed. I can look into it someone soon, but don't have an exact timeline.
@ilyakatz I can join a call sometime, and I'm happy to provide time to updates, assuming in alignment on the end goal. You're the maintainer, so you get to decide where the project goes from here, and I don't want to get in the way of your progress if we're headed in different directions. I'd like to see the gem support migrations the way rails does the schema, but I understand that this may not fit in your vision of how the gem should work. Let me know your thoughts.
@chaunce thanks, could you email me and we'll coordinate with @wildmaples to set up a call
Hi everyone,
I tend to agree with @chaunce on following the Rails approach for migrations to run on specific connections. It might sounds like an overkill, but should be more versatile. We all might have different database setups since Rails allows that, but it doesn't mean that data_migrate should pick only one to support. As long as this gem claims that data migrations act like schema migrations, similar implementation seems reasonable.
However, regardless of the approach, I don't mind some breaking changes as long as there is a clear upgrade path documented somewhere.
P.S. @ilyakatz We have situation similar to Gusto (maybe not that heavily using data migrations though) and as of version 10 data migrations stopped working for us completely. Hence my recent activity with some fixes.
Hi folks, just wanted to provide an update, we had a call with @ngan and @wildmaples. They will help on gem development, particularly for multi-database support and/or rails 7.1 support. I've added them as contributors. Since v10 has introduced issues with many folks, we will need to roll it back. Thanks @chaunce for your contribution, we certainly appreciate your work but seems we'd need another approach to keep it usable for wider audience. This was a good learning experience for sure.
I will yank v10.0.x versions to avoid people picking up versions that are not working