thinky Support embedded arrays for hasAndBelongsToMany relationships

Today I was watching this talk by Jorge Silva on data modeling in rethinkdb, and he goes through a number of different approaches for storing related data. The first technique he demonstrated was using a 3rd table for joining, exactly how thinky sets up n-to-n with hasAndBelongsToMany. He then went on to talk about another technique; embedding an array of ids into the document. Here's the video (skipped to the relevant part) and also an example of the structure in his demo:

Jorge brings up a few positive points about embedded arrays:

it may be more efficient
avoids an intersecting table
queries are simpler
this is the approach that he normally recommends

Since it is a fairly common (and highly recommended) pattern, shouldn't thinky support setting up relationships in this manner? It may be true that n-to-n relationships created this way would be unidirectional, but that constraint is perfectly reasonable (and probably desired) when using this pattern.

Sep 09 '15 09:09 sjmueller

it may be more efficient

Note that it's a may:

It's not more efficient for a write work load.
It's also less efficient if you just need a few fields from your documents.
It's also not obvious that it's even faster when fetching a city with its state

avoids an intersecting table

Fair enough, though that doesn't mean it's always faster.

queries are simpler

Not true. Try to write the query to fetch a city with its state. Then try to write a query that update the citi's state. Thinky used to use this pattern for its 0.x version and it was a pain to maintain relations.

this is the approach that he normally recommends

Fair enough, but I disagree with him. It's a poor approach in my opinion. It's also not relevant for thinky I think because:

thinky does the joins under the hood. Whether it uses a third table or a embedded array of values is not visible to the developer.
It's a mongodb-ish approach and since RethinkDB has server side joins, there's little reason to use this pattern

TL;DR: Thinky use to implement joins that way (hasMany and hasAndBelongsToMany) but reverted to a SQL-ish way of doing it in its 1.x version (I think)

Sep 09 '15 15:09 neumino

You bring up valid counter-arguments. However, I do think it's worthwhile to arrive at more unified guidance for data modeling, because this video is less than a month old and the message was fairly prescriptive. Thinky adds huge value to the rethinkdb ecosystem, but the product is still young and newcomers will have more confidence when best practices are very clear and well understood.

+@thejsj and @coffeemug to chime in with more clarity.

Sep 09 '15 20:09 sjmueller

+@dalanmiller as well.

Sep 09 '15 20:09 sjmueller

Like in so many words @neumino said, it largely depends on your data and your more common use case. @thejsj just gave different possible possibilities, but there is definitely no one-size-fits-all situation.

In your case, you mentioned in gitter having to do a lot frequent writes to your attachments data model, I think it'd be a safe bet to keep it separate (and thus small) and join via ReQL when necessary with the message/email table you mentioned.

Sep 09 '15 21:09 dalanmiller

@dalanmiller I agree that in the email example, 3rd join table is the best solution. However my point is that with Thinky, there is currently no choice between the two models right now; you don't have the option. Maybe this turns out to be ok, but it certainly doesn't align with the no-one-size-fits-all idea or the sentiment from @thejsj 's talk.

Sep 09 '15 23:09 sjmueller

@sjmueller -- In most of the cases, using a third table is better. The case when you actually get better performance from using embedded arrays is really narrow as far as I know. Having a simpler syntax is not relevant since thinky is doing the join under the hood.

If you think you need joins to be done via embedded arrays, you are welcome to send a pull request but this is tricky and a lot of work.

Sep 10 '15 01:09 neumino

Just wanted to add my own opinion here.

First, I'm glad someone actually someone saw my talk! Thanks @sjmueller. I spent a lot of time going over different approaches and talking to all the engineers at Rethink in order to arrive at what I said. Ultimately, I want to stress again that it depends!

While I do think that it might be nice if something like Thinky (and this is coming form someone who's never used it!) would have something like what @sjmueller suggests, it seems that that would be very hard to implement and it makes all the sense in the world to me that @neumino implemented that part of Thinky with intersection tables, because it is probably the most flexible way to do this.

My talk was based around people modeling their own data and not really for people using ORMs. Implementing an ORM is not something I really considered and it would make sense to me to use an intersection table for that.

@sjmueller If you're having performance problems, I'd personally love to see what's going on and if there's a way it could be improved.

Sep 12 '15 23:09 thejsj

@thejsj I'm not at the point where I'm profiling performance just yet, although I'll be sure to give an update when I do.

This issue was more to open a conceptual discussion about data modeling; specifically about the recommended approach in your video that is currently not available when using thinky. I'm not an expert with rethinkdb yet, so I wanted to know whether or not it mattered that embedded id arrays are not available for relationships. I get the sense that it does not matter too much, although I'll only know definitively as I gain more experience with thinky and rethinkdb.

I also want to take a moment to say thanks to @neumino for your work on thinky. So far it's been an immense productivity boost! I'm even surprised at how well it handled a few edge cases that I originally thought would need some hacks to get working.

Sep 14 '15 05:09 sjmueller