mosql icon indicating copy to clipboard operation
mosql copied to clipboard

Creating relational tables

Open dpatti opened this issue 12 years ago • 27 comments

One extension that I am looking at working on is some sort of support for many-to-many relationships. Typically you would have a table that contains the primary keys of each of the relevant rows, but in MongoDB you might use an array within a document. So I was thinking about something like this:

my_blog:
  posts:
    :columns:
      - _id: VARCHAR(24)
      - title: VARCHAR(150)
    :meta:
      :table: posts
      :extra_props: true
    :related:
      user_post_likes:
        - _id: VARCHAR(24)
        - users_liked[]: VARCHAR(24)
      user_post_ratings:
        - _id: VARCHAR(24)
        - ratings[].user_id: VARCHAR(24)
        - ratings[].score: INT

So basically it creates two relational tables, and in the first it just reads an array of ObjectIDs or strings from users_liked, whereas the second reads an array of subdocs containing user_id and score attributes. The [] indicates that a new row should be inserted for each item in the array. I don't know much about how the oplog tailer works, but there could be massive problems that I'm overlooking.

Any thoughts or feedback?

dpatti avatar Feb 26 '13 22:02 dpatti

(CC @andrewjshults, who has also been thinking about this)

I'm definitely open to the idea of supporting unrolling into relational tables. There are a bunch of tricky questions I don't have good answers, too, though, such as what to use for unique IDs on the unrolled tables, and how you manage updates -- we don't actually see, e.g. $pop operations as such in the oplog, we just see a $set of the entire array. Do you have to in general do a diff of some sort against the existing array to figure out what updates to perform?

I haven't had time to give this a ton of thought, and I think there may not be perfectly clean answers to all of these questions, but I'd be open to something workable if someone has the time to think about it harder.

nelhage avatar Mar 06 '13 19:03 nelhage

Do updatable views solve this somehow?

cbandy avatar Aug 01 '13 16:08 cbandy

I'm not particularly familiar with the capabilities of PostgreSQL's updatable views, off hand. If someone more up to speed on them wants to propose something, I'd be happy to think about it.

nelhage avatar Aug 01 '13 17:08 nelhage

Any further thoughts or update on this?

I'm facing the exact scenario... need to unroll an embedded array into a different table.

taylorbrooks avatar Jan 23 '14 02:01 taylorbrooks

Unfortunately, I'm not longer at the place that I was at when I originally started working on this, so my work is quite out of date at this point. There's some work on my fork for unrolling embedded dicts, but as @nelhage mentioned in his earlier comment unrolling arrays and properly supporting keeping them in sync between the two databases will be quite a bit of work (and may be hard to be entirely generalizable).

andrewjshults avatar Jan 23 '14 03:01 andrewjshults

Yeah I don't need to keep the databases in sync. The mongodb is small enough that I can drop and reimport periodically.

What branch of your fork have you worked on this?

taylorbrooks avatar Jan 23 '14 16:01 taylorbrooks

I've also forked the project and added a few things of my own, including relational tables. We don't support oplog tailing, since even without any of the additions, the oplog tailer can't keep up with our traffic. Though a bit out of date with upstream, we use it in production just fine. The branch is here:

https://github.com/dpatti/mosql/commits/fogcreek

It's implemented exactly as I wrote originally, though we use the newer long-form representation of columns, like so:

my_blog:
  posts:
    :columns:
      - _id: VARCHAR(24)
      - title: VARCHAR(150)
    :meta:
      :table: posts
      :extra_props: true
    :related:
      user_post_likes:
        - post_id:
          :source: _id
          :type: VARCHAR(24)
        - user_id:
          :source: users_liked[]
          :type: VARCHAR(24)
      user_post_ratings:
        - post_id:
          :source: _id
          :type: VARCHAR(24)
        - user_id:
          :source: ratings[].user_id
          :type: VARCHAR(24)
        - score:
          :source: ratings[].score
          :type: INT

dpatti avatar Jan 23 '14 17:01 dpatti

Thanks @dpatti.

Want to make sure I understand this clearly... In your example, users_liked is an array of user ids embedded with the posts collection and this yaml file will create a user_post_likes relational table in pg? (Same goes with post ratings?)

taylorbrooks avatar Jan 23 '14 17:01 taylorbrooks

Correct, so specifying users_liked[] as the source does a simple unwind. If you are using subdocs, you can unwind a specific attribute of each of the subdocs in the array, such as ratings[].user_id and ratings[].score. I would not recommend unwinding two different arrays within the same relational table. I don't know if my code can handle that, though I remember thinking about it hard when I wrote it.

dpatti avatar Jan 23 '14 17:01 dpatti

Cool. Yeah off the top of my head I don't think I have a case where I'd need to do that.

Thanks a ton!

taylorbrooks avatar Jan 23 '14 17:01 taylorbrooks

Hrm, anything I'm missing here? I can't get the embedded array to unwind into a new table.

mongodb:
  accounts:
    :columns:
    - id:
      :source: _id
      :type: VARCHAR(24)
    - name:
      :source: name
      :type: VARCHAR(100)
    - state:
      :source: state
      :type: VARCHAR(24)
    :meta:
      :table: accounts
    :related:
      subscriptions:
        - subscription_id:
          :source: subscriptions[]._id
          :type: INT
        - state:
          :source: subscriptions[].state
          :type: VARCHAR(24)
        - product_id:
          :source: subscriptions[].product_id
          :type: VARCHAR(24)
        - account_id:
          :source: _id
          :type: VARCHAR(24)
      cards:
        - last_4: 
          :source: cards[].last_4
          :type: INT
        - expiration_month:
          :source: cards[].expiration_month
          :type: INT
        - expiration_year:
          :source: cards[].expiration_year
          :type: INT
        - account_id:
          :source: _id
          :type: VARCHAR(24)

taylorbrooks avatar Jan 23 '14 19:01 taylorbrooks

Are you getting an error at all? I tried it with some dummy data and it seemed to work for me. You put subscriptions[]._id as type INT -- you didn't by chance mean VARCHAR(24), did you? If it's still not working, could you paste an example document?

Edit: Here's how I run it:

MOSQL=path/to/mosql/checkout
export BUNDLE_GEMFILE=$MOSQL/Gemfile
export RUBYLIB=$MOSQL/lib

MONGO=mongodb://localhost/
POSTGRES=postgres://localhost/testdb
bundle exec \
  $MOSQL/bin/mosql --collections --collections.yml \
                   --mongo $MONGO \
                   --sql $POSTGRES \
                   --reimport \
                   --skip-tail

dpatti avatar Jan 23 '14 20:01 dpatti

Changed type to VARCHAR(24), didn't change the import.

Here's a sample doc: https://gist.github.com/taylorbrooks/15204ce0fd219b6e9c3d

Trying your way of running it now.

taylorbrooks avatar Jan 23 '14 20:01 taylorbrooks

I removed the ellipses from your doc, inserted it into mongo, and it exported fine. I got all 3 tables and 1 row in each.

dpatti avatar Jan 23 '14 20:01 dpatti

I think I got it working. I deleted all versions of mosql, install from your fork and used your CLI stuff.

I still get this error (from what I can tell it's still correctly imported accounts and subscriptions) :

INFO MoSQL: Creating table 'accounts'...
INFO MoSQL: Creating table 'subscriptions'...
INFO MoSQL: Importing for Mongo DB sso_development...
INFO MoSQL: Importing for sso_development.accounts...
INFO MoSQL: Imported 339 rows into 'accounts' (0.1s, 0.0s SQL)...
INFO MoSQL: Imported 209 rows into 'subscriptions' (0.1s, 0.0s SQL)...
/Users/taylorbrooks/Documents/Rails/mosql/lib/mosql/cli.rb:277:in `add_identity': undefined method `lit' for "extract(epoch from now())":String (NoMethodError)
    from /Users/taylorbrooks/Documents/Rails/mosql/lib/mosql/cli.rb:267:in `check_index_creation'
    from /Users/taylorbrooks/Documents/Rails/mosql/lib/mosql/cli.rb:219:in `initial_import'
    from /Users/taylorbrooks/Documents/Rails/mosql/lib/mosql/cli.rb:148:in `run'
    from /Users/taylorbrooks/Documents/Rails/mosql/lib/mosql/cli.rb:16:in `run'
    from ../../Rails/mosql/bin/mosql:5:in `<main>'

taylorbrooks avatar Jan 23 '14 20:01 taylorbrooks

Oh right. That has something to do with sequel. I know my Gemfile.lock uses sequel (3.44.0), and I know that when I tried updating it one time, it ended up not having the String#lit definition anymore. As far as I can tell, it's still there in master, so I don't know if you want to try forcing the same version I am using in the lock.

dpatti avatar Jan 23 '14 21:01 dpatti

Is work still ongoing on this? Would love to see it merged in at some point... unfortunately I'm getting some import errors when using the fork that I don't get from current mosql master.

zapnap avatar Apr 01 '14 15:04 zapnap

Unfortunately I don't think it can be merged in, or at least not trivially. I made all of my changes under the assumption that you are not using the oplog tailer. What kind of error are you seeing? It could certainly be something in my code. On Apr 1, 2014 11:32 AM, "Nick Plante" [email protected] wrote:

Is work still ongoing on this? Would love to see it merged in at some point... unfortunately I'm getting some import errors when using the fork that I don't get from current mosql master.

Reply to this email directly or view it on GitHubhttps://github.com/stripe/mosql/issues/11#issuecomment-39219554 .

dpatti avatar Apr 01 '14 15:04 dpatti

@dpatti Thanks for your work, but unfortunately, there's definitely something wrong with your code. I tried a basic YAML without any related stuff in it, and it chokes, whereas the stripe mosql gem works perfectly. I'm using sequel 3.44.0 as you recommended, and I'm passing in the --skip-tail option.

Here's my YAML:

ohana-api-smc_development:
  locations:
    :columns:
    - id:
      :source: _id
      :type: TEXT
    - organization_id: TEXT
    - accessibility: TEXT
    - admin_emails:
      :source: admins
      :type: TEXT
    - description: TEXT
    - emails: TEXT
    - hours: TEXT
    - languages: TEXT
    - name: TEXT
    - short_desc: TEXT
    - transportation: TEXT
    - urls: TEXT
    - updated_at: TIMESTAMP
    - created_at: TIMESTAMP
    :meta:
      :table: locations
      :extra_props: true

Your code results in 3 distinct errors that appear a bunch of times: 1.

WARN MoSQL: Ignoring row (id=): PG::NotNullViolation: ERROR:  null value in column "id" violates not-null constraint
DETAIL:  Failing row contains (null, null, null, null, null, null, null, null, null, null, null, null, null, null, null).
PG::ProgramLimitExceeded: ERROR:  index row size 3568 exceeds maximum 2712 for index "locations_pkey"
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
PG::DatatypeMismatch: ERROR:  column "updated_at" is of type timestamp without time zone but expression is of type record
LINE 1: ...","hours":"(24 hours daily)"}]}'), "updated_at" = ('521d32e4...
                                                             ^
HINT:  You will need to rewrite or cast the expression.

Here's a sample entry from the DB:

{
  "_id": {
    "$oid": "521d32b91974fcdb2b000002"
  },
  "_slugs": [
    "san-mateo-county-slash-silicon-valley-convention-and-visitors-bureau"
  ],
  "accessibility": [
    "wheelchair"
  ],
  "address": {
    "_id": {
      "$oid": "521d32b91974fcdb2b000004"
    },
    "street": "111 Anza Blvd., Suite 410",
    "city": "Burlingame",
    "state": "CA",
    "zip": "94010"
  },
  "ask_for": [
    "Receptionist"
  ],
  "contacts": [
    {
      "_id": {
        "$oid": "521d32b91974fcdb2b000003"
      },
      "name": "Anne LeClair, CAE, CDME",
      "title": "President, CEO"
    }
  ],
  "coordinates": [
    -122.3486066,
    37.5909749
  ],
  "created_at": {
    "$date": 1377645241035
  },
  "description": "Offers information about the county's attractions and brochures and maps of cities in the county. Assists groups and individuals in booking sleeping rooms and meetings.",
  "emails": [
    "[email protected]"
  ],
  "faxes": [],
  "hours": "Monday-Thursday, 8:30-5; Friday, 8:30-4",
  "kind": "other",
  "languages": null,
  "mail_address": {
    "_id": {
      "$oid": "521d32b91974fcdb2b000005"
    },
    "attention": "Convention and Visitors Bureau",
    "street": "111 Anza Blvd., Suite 410",
    "city": "Burlingame",
    "state": "CA",
    "zip": "94010"
  },
  "name": "San Mateo County/Silicon Valley Convention and Visitors Bureau",
  "organization_id": {
    "$oid": "521d32b81974fcdb2b000001"
  },
  "phones": [
    {
      "number": "800 288-4748",
      "hours": "(Monday-Thursday, 8:30-5; Friday, 8:30-4)"
    }
  ],
  "short_desc": "Offers information about the county's attractions and brochures and maps of cities in the county.",
  "transportation": "SAMTRANS stops within 6 blocks.",
  "updated_at": {
    "$date": 1383857247210
  },
  "urls": [
    "http://www.smccvb.com"
  ]
}

I hope this helps you troubleshoot. Thanks!

monfresh avatar Apr 28 '14 22:04 monfresh

@monfresh Thanks for the details, but would you mind posting them as an issue on my fork? I don't want to keep generating notifications for Mosql contributors for code that isn't going to be merged upstream. That said, I will try and find some time in the next few days to reproduce and fix it.

dpatti avatar Apr 28 '14 22:04 dpatti

I'm trying to create relational tables, are there any updates regarding this issue? @dpatti I can't find your fork.

Any help would be really appreciated.

AboulEinein avatar Apr 17 '16 16:04 AboulEinein

@AboulEinein The fork was eventually moved here: https://github.com/trello/mosql/tree/streamless-fork

Though I think it is still being used at Trello, it is very out of date and probably comes with a lot of caveats. I'm not involved with it anymore, so that's about as much as I can offer.

dpatti avatar Apr 18 '16 03:04 dpatti

@dpatti thank you so much for the reply! I'll check it out.

AboulEinein avatar Apr 18 '16 09:04 AboulEinein

@AboulEinein @monfresh any of you still have @dpatti 's fork?

laurentrivard avatar Jun 14 '17 02:06 laurentrivard

@laurentrivard I do not. I haven't used this since 2014.

monfresh avatar Jun 14 '17 02:06 monfresh

@laurentrivard I do not, unfortunately.

Check AWS Database Migration Service (https://aws.amazon.com/dms/)

AboulEinein avatar Jun 14 '17 09:06 AboulEinein

@laurentrivard I moved the fork into @trello a few years ago while I worked there, but seeing as how it's not there anymore, they either deleted it or made it private.

dpatti avatar Jun 14 '17 14:06 dpatti