Creating relational tables
One extension that I am looking at working on is some sort of support for many-to-many relationships. Typically you would have a table that contains the primary keys of each of the relevant rows, but in MongoDB you might use an array within a document. So I was thinking about something like this:
my_blog:
posts:
:columns:
- _id: VARCHAR(24)
- title: VARCHAR(150)
:meta:
:table: posts
:extra_props: true
:related:
user_post_likes:
- _id: VARCHAR(24)
- users_liked[]: VARCHAR(24)
user_post_ratings:
- _id: VARCHAR(24)
- ratings[].user_id: VARCHAR(24)
- ratings[].score: INT
So basically it creates two relational tables, and in the first it just reads an array of ObjectIDs or strings from users_liked, whereas the second reads an array of subdocs containing user_id and score attributes. The [] indicates that a new row should be inserted for each item in the array. I don't know much about how the oplog tailer works, but there could be massive problems that I'm overlooking.
Any thoughts or feedback?
(CC @andrewjshults, who has also been thinking about this)
I'm definitely open to the idea of supporting unrolling into relational tables. There are a bunch of tricky questions I don't have good answers, too, though, such as what to use for unique IDs on the unrolled tables, and how you manage updates -- we don't actually see, e.g. $pop operations as such in the oplog, we just see a $set of the entire array. Do you have to in general do a diff of some sort against the existing array to figure out what updates to perform?
I haven't had time to give this a ton of thought, and I think there may not be perfectly clean answers to all of these questions, but I'd be open to something workable if someone has the time to think about it harder.
Do updatable views solve this somehow?
I'm not particularly familiar with the capabilities of PostgreSQL's updatable views, off hand. If someone more up to speed on them wants to propose something, I'd be happy to think about it.
Any further thoughts or update on this?
I'm facing the exact scenario... need to unroll an embedded array into a different table.
Unfortunately, I'm not longer at the place that I was at when I originally started working on this, so my work is quite out of date at this point. There's some work on my fork for unrolling embedded dicts, but as @nelhage mentioned in his earlier comment unrolling arrays and properly supporting keeping them in sync between the two databases will be quite a bit of work (and may be hard to be entirely generalizable).
Yeah I don't need to keep the databases in sync. The mongodb is small enough that I can drop and reimport periodically.
What branch of your fork have you worked on this?
I've also forked the project and added a few things of my own, including relational tables. We don't support oplog tailing, since even without any of the additions, the oplog tailer can't keep up with our traffic. Though a bit out of date with upstream, we use it in production just fine. The branch is here:
https://github.com/dpatti/mosql/commits/fogcreek
It's implemented exactly as I wrote originally, though we use the newer long-form representation of columns, like so:
my_blog:
posts:
:columns:
- _id: VARCHAR(24)
- title: VARCHAR(150)
:meta:
:table: posts
:extra_props: true
:related:
user_post_likes:
- post_id:
:source: _id
:type: VARCHAR(24)
- user_id:
:source: users_liked[]
:type: VARCHAR(24)
user_post_ratings:
- post_id:
:source: _id
:type: VARCHAR(24)
- user_id:
:source: ratings[].user_id
:type: VARCHAR(24)
- score:
:source: ratings[].score
:type: INT
Thanks @dpatti.
Want to make sure I understand this clearly... In your example, users_liked is an array of user ids embedded with the posts collection and this yaml file will create a user_post_likes relational table in pg? (Same goes with post ratings?)
Correct, so specifying users_liked[] as the source does a simple unwind. If you are using subdocs, you can unwind a specific attribute of each of the subdocs in the array, such as ratings[].user_id and ratings[].score. I would not recommend unwinding two different arrays within the same relational table. I don't know if my code can handle that, though I remember thinking about it hard when I wrote it.
Cool. Yeah off the top of my head I don't think I have a case where I'd need to do that.
Thanks a ton!
Hrm, anything I'm missing here? I can't get the embedded array to unwind into a new table.
mongodb:
accounts:
:columns:
- id:
:source: _id
:type: VARCHAR(24)
- name:
:source: name
:type: VARCHAR(100)
- state:
:source: state
:type: VARCHAR(24)
:meta:
:table: accounts
:related:
subscriptions:
- subscription_id:
:source: subscriptions[]._id
:type: INT
- state:
:source: subscriptions[].state
:type: VARCHAR(24)
- product_id:
:source: subscriptions[].product_id
:type: VARCHAR(24)
- account_id:
:source: _id
:type: VARCHAR(24)
cards:
- last_4:
:source: cards[].last_4
:type: INT
- expiration_month:
:source: cards[].expiration_month
:type: INT
- expiration_year:
:source: cards[].expiration_year
:type: INT
- account_id:
:source: _id
:type: VARCHAR(24)
Are you getting an error at all? I tried it with some dummy data and it seemed to work for me. You put subscriptions[]._id as type INT -- you didn't by chance mean VARCHAR(24), did you? If it's still not working, could you paste an example document?
Edit: Here's how I run it:
MOSQL=path/to/mosql/checkout
export BUNDLE_GEMFILE=$MOSQL/Gemfile
export RUBYLIB=$MOSQL/lib
MONGO=mongodb://localhost/
POSTGRES=postgres://localhost/testdb
bundle exec \
$MOSQL/bin/mosql --collections --collections.yml \
--mongo $MONGO \
--sql $POSTGRES \
--reimport \
--skip-tail
Changed type to VARCHAR(24), didn't change the import.
Here's a sample doc: https://gist.github.com/taylorbrooks/15204ce0fd219b6e9c3d
Trying your way of running it now.
I removed the ellipses from your doc, inserted it into mongo, and it exported fine. I got all 3 tables and 1 row in each.
I think I got it working. I deleted all versions of mosql, install from your fork and used your CLI stuff.
I still get this error (from what I can tell it's still correctly imported accounts and subscriptions) :
INFO MoSQL: Creating table 'accounts'...
INFO MoSQL: Creating table 'subscriptions'...
INFO MoSQL: Importing for Mongo DB sso_development...
INFO MoSQL: Importing for sso_development.accounts...
INFO MoSQL: Imported 339 rows into 'accounts' (0.1s, 0.0s SQL)...
INFO MoSQL: Imported 209 rows into 'subscriptions' (0.1s, 0.0s SQL)...
/Users/taylorbrooks/Documents/Rails/mosql/lib/mosql/cli.rb:277:in `add_identity': undefined method `lit' for "extract(epoch from now())":String (NoMethodError)
from /Users/taylorbrooks/Documents/Rails/mosql/lib/mosql/cli.rb:267:in `check_index_creation'
from /Users/taylorbrooks/Documents/Rails/mosql/lib/mosql/cli.rb:219:in `initial_import'
from /Users/taylorbrooks/Documents/Rails/mosql/lib/mosql/cli.rb:148:in `run'
from /Users/taylorbrooks/Documents/Rails/mosql/lib/mosql/cli.rb:16:in `run'
from ../../Rails/mosql/bin/mosql:5:in `<main>'
Oh right. That has something to do with sequel. I know my Gemfile.lock uses sequel (3.44.0), and I know that when I tried updating it one time, it ended up not having the String#lit definition anymore. As far as I can tell, it's still there in master, so I don't know if you want to try forcing the same version I am using in the lock.
Is work still ongoing on this? Would love to see it merged in at some point... unfortunately I'm getting some import errors when using the fork that I don't get from current mosql master.
Unfortunately I don't think it can be merged in, or at least not trivially. I made all of my changes under the assumption that you are not using the oplog tailer. What kind of error are you seeing? It could certainly be something in my code. On Apr 1, 2014 11:32 AM, "Nick Plante" [email protected] wrote:
Is work still ongoing on this? Would love to see it merged in at some point... unfortunately I'm getting some import errors when using the fork that I don't get from current mosql master.
Reply to this email directly or view it on GitHubhttps://github.com/stripe/mosql/issues/11#issuecomment-39219554 .
@dpatti Thanks for your work, but unfortunately, there's definitely something wrong with your code. I tried a basic YAML without any related stuff in it, and it chokes, whereas the stripe mosql gem works perfectly. I'm using sequel 3.44.0 as you recommended, and I'm passing in the --skip-tail option.
Here's my YAML:
ohana-api-smc_development:
locations:
:columns:
- id:
:source: _id
:type: TEXT
- organization_id: TEXT
- accessibility: TEXT
- admin_emails:
:source: admins
:type: TEXT
- description: TEXT
- emails: TEXT
- hours: TEXT
- languages: TEXT
- name: TEXT
- short_desc: TEXT
- transportation: TEXT
- urls: TEXT
- updated_at: TIMESTAMP
- created_at: TIMESTAMP
:meta:
:table: locations
:extra_props: true
Your code results in 3 distinct errors that appear a bunch of times: 1.
WARN MoSQL: Ignoring row (id=): PG::NotNullViolation: ERROR: null value in column "id" violates not-null constraint
DETAIL: Failing row contains (null, null, null, null, null, null, null, null, null, null, null, null, null, null, null).
PG::ProgramLimitExceeded: ERROR: index row size 3568 exceeds maximum 2712 for index "locations_pkey"
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
PG::DatatypeMismatch: ERROR: column "updated_at" is of type timestamp without time zone but expression is of type record
LINE 1: ...","hours":"(24 hours daily)"}]}'), "updated_at" = ('521d32e4...
^
HINT: You will need to rewrite or cast the expression.
Here's a sample entry from the DB:
{
"_id": {
"$oid": "521d32b91974fcdb2b000002"
},
"_slugs": [
"san-mateo-county-slash-silicon-valley-convention-and-visitors-bureau"
],
"accessibility": [
"wheelchair"
],
"address": {
"_id": {
"$oid": "521d32b91974fcdb2b000004"
},
"street": "111 Anza Blvd., Suite 410",
"city": "Burlingame",
"state": "CA",
"zip": "94010"
},
"ask_for": [
"Receptionist"
],
"contacts": [
{
"_id": {
"$oid": "521d32b91974fcdb2b000003"
},
"name": "Anne LeClair, CAE, CDME",
"title": "President, CEO"
}
],
"coordinates": [
-122.3486066,
37.5909749
],
"created_at": {
"$date": 1377645241035
},
"description": "Offers information about the county's attractions and brochures and maps of cities in the county. Assists groups and individuals in booking sleeping rooms and meetings.",
"emails": [
"[email protected]"
],
"faxes": [],
"hours": "Monday-Thursday, 8:30-5; Friday, 8:30-4",
"kind": "other",
"languages": null,
"mail_address": {
"_id": {
"$oid": "521d32b91974fcdb2b000005"
},
"attention": "Convention and Visitors Bureau",
"street": "111 Anza Blvd., Suite 410",
"city": "Burlingame",
"state": "CA",
"zip": "94010"
},
"name": "San Mateo County/Silicon Valley Convention and Visitors Bureau",
"organization_id": {
"$oid": "521d32b81974fcdb2b000001"
},
"phones": [
{
"number": "800 288-4748",
"hours": "(Monday-Thursday, 8:30-5; Friday, 8:30-4)"
}
],
"short_desc": "Offers information about the county's attractions and brochures and maps of cities in the county.",
"transportation": "SAMTRANS stops within 6 blocks.",
"updated_at": {
"$date": 1383857247210
},
"urls": [
"http://www.smccvb.com"
]
}
I hope this helps you troubleshoot. Thanks!
@monfresh Thanks for the details, but would you mind posting them as an issue on my fork? I don't want to keep generating notifications for Mosql contributors for code that isn't going to be merged upstream. That said, I will try and find some time in the next few days to reproduce and fix it.
I'm trying to create relational tables, are there any updates regarding this issue? @dpatti I can't find your fork.
Any help would be really appreciated.
@AboulEinein The fork was eventually moved here: https://github.com/trello/mosql/tree/streamless-fork
Though I think it is still being used at Trello, it is very out of date and probably comes with a lot of caveats. I'm not involved with it anymore, so that's about as much as I can offer.
@dpatti thank you so much for the reply! I'll check it out.
@AboulEinein @monfresh any of you still have @dpatti 's fork?
@laurentrivard I do not. I haven't used this since 2014.
@laurentrivard I do not, unfortunately.
Check AWS Database Migration Service (https://aws.amazon.com/dms/)
@laurentrivard I moved the fork into @trello a few years ago while I worked there, but seeing as how it's not there anymore, they either deleted it or made it private.