prisma1 Improving performance with Indexes

Improving performance with Indexes

Open sorenbs opened this issue 6 years ago • 48 comments

Please see latest spec in: https://github.com/prisma/prisma/issues/3405

In order to be able to optimise performance for the actual workload, it should be possible for the development team to specify the exact indexes they need.

Indexes currently being created

primary key

All models are backed by a single database table with a id column containing the node id. This column is indexed as the primary key.

unique fields

Uniqueness is guaranteed by creating a unique index on the column.

Example

type User @model {
  id: ID! @isUnique
  name: String @isUnique
  city: String
}

This model generates a table with two indexes:

primaryKey(id)
unique(name)

relations

All relations are implemented with a intermediate relation table:

type User @model {
 id: ID! @isUnique
 name: String
 posts: [Post!]! @relation(name: "Posts")
}

type Post @model {
 id: ID! @isUnique
 title: String
 creator: user! @relation(name: "Posts")
}

generates three database tables:

User

id	name
string	string

primaryKey(id)

Post

id	title
string	string

primaryKey(id)

Posts

id	A	B
string	string	string

primaryKey(id)
unique(id)
index(A)
index(B)
unique(A,B)

Simplifying indexes on relation tables

Remove the id coulmn

There is no need for a dedicated id column on relation tables

Instead, use the compound index A,B as the primary key to guarantee that a node can only be inserted once into the relation

Two indexes is enough to provide fast traversal

To provide fast traversal in both directions we need two indexes: one on A and one on B.

As we already have a compound index with A as prefix, we just need an index on B

We end up with this structure:

A	B
string	string

primaryKey(A,B)
index(B)

Complexity Analysis

Single Model

non unique field

{allUsers(filter:{age: 25})}

Currently: Tablescan
Best Case: index lookup | index(age)

{allUsers(filter:{age: 25, orderBy: MONEY_ASC})}

Currently: Tablescan
Best Case: index lookup | index(age, money)
More Realistic: index lookup + filesort | index(age)

two fields

{allUsers(filter:{age: 25, name: "Karl"})}

Currently: Tablescan
Best Case: compound index | index(age, name)
More Realistic: Index intersection or single key index | index(age), index(name)

{allUsers(filter:{age: 25, name: "Karl"}, orderBy: MONEY_ASC)}

Currently: Tablescan
Best Case: compound index | index(age, name, money)
More Realistic: Index intersection or single key index + filesort | index(age), index(name)

range query

{allUsers(filter:{age_gt: 25})}

Currently: Tablescan
Best Case: index seek | index(age)

{allUsers(filter:{age_gt: 25}, orderBy: AGE_ASC)}

Currently: Tablescan
Best Case: index seek | index(age)

{allUsers(filter:{age_lt: 25}, orderBy: AGE_ASC)}

Currently: Tablescan
Best Case: index seek + filesort | index(age) (range filter and orderby not compatible)

{allUsers(filter:{age_gt: 25}, orderBy: MONEY_ASC)}

Currently: Tablescan
Best Case: index seek + filesort | index(age)

range query and equality

{allUsers(filter:{age_gt: 25, name: "Karl"})}

Currently: Tablescan
Best Case: compound index seek | index(name, age)
More Realistic: index seek + table lookup | index(name)

[Sorting same as range query]

range query on two fields

{allUsers(filter:{age_gt: 25, money_gt: 10000})}

Currently: Tablescan
Best Case: compound index seek | index(name, money)
More Realistic: index seek + table lookup | index(money)

{allUsers(filter:{age_gt: 25, money_gt: 10000}, orderBy: MONEY_ASC)}

Currently: Tablescan
Best Case: compound index seek | index(name, money)
More Realistic: index seek + table lookup + filesort | index(money)

{allUsers(filter:{age_gt: 25, money_gt: 10000}, orderBy: FAMILYMEMBERS_ASC)}

Currently: Tablescan
Best Case: compound index seek | index(name, money, familyMembers)
More Realistic: index seek + table lookup + filesort | index(money)

Multiple Models

simple relation

{User(id:""){posts{title}}}

Currently: 2 x index lookup | index(A), index(id)
Best Case: 2 x index lookup | index(A), index(id)

{User(id:""){posts(orderBy: LIKES_ASC){title}}}

Currently: 2 x index lookup + filesort | index(A), index(id)
Best Case: 2 x index lookup | mirror(posts.likes), index(A, likes), index(id)

relation with single equality

{User(id:""){posts(filter:{title: "nice title"}){title}}}

Currently: index lookup + overfetching index lookup | index(A), index(id)
Best Case: 2 x index lookup | mirror(posts.title) index(A, title), index(id)

{User(id:""){posts(filter:{title: "nice title"}, orderBy: LIKES_ASC){title}}}

Currently: index lookup + overfetching index lookup + filesort | index(A), index(id)
Best Case: 2 x index lookup | mirror(posts.likes), mirror(posts.title), index(A, title, likes), index(id)
More Realistic: index lookup + overfetching index lookup | mirror(posts.likes), index(A, likes), index(id)

relation with range query

{User(id:""){posts(filter:{likes_gt: 3}){title}}}

Currently: index lookup + overfetching index lookup | index(A), index(id)
Best Case: index seek + index lookup | mirror(posts.likes) index(A, likes), index(id)

{User(id:""){posts(filter:{likes_gt: 3}, orderBy; AGE_ASC){title}}}

Currently: index lookup + overfetching index lookup + filesort | index(A), index(id)
Best Case: index lookup + overfetching index lookup | mirror(posts.age) index(A, age), index(id)

{User(id:""){posts(filter:{likes_gt: 3}, orderBy; LIKES_ASC){title}}}

Currently: index lookup + overfetching index lookup + filesort | index(A), index(id)
Best Case: index lookup + index seek | mirror(posts.likes) index(A, likes), index(id)

Mirror fields to relation table

Many of the Best Case and Realistic Case scenarios above depend on one or more columns from a model table to be mirrored in the relation table. Mirroring a column into the relation table enables us to create a compound index across relations. In the relation example above, we might introduce a mirrored field like this:

type User @model {
 id: ID! @isUnique
 name: String
 posts: [Post!]! @relation(name: "Posts")
}

type Post @model {
 id: ID! @isUnique
 title: String # mirror this field into Posts relation
 creator: User! @relation(name: "Posts")
}

generates three database tables:

User

id	name
string	string

primaryKey(id)

Post

id	title
string	string

primaryKey(id)

Posts

id	A	B	B_title
string	string	string	string

primaryKey(A,B)
index(B)
index(A, B_title)

The new compound index allows us to speed up queries like this:

{allUsers(filter:{posts_any: {title: "A Title"}}) {...} }

Mirrored fields must be kept in sync by the graphcool backend

Specifying indexes in `types.graphql`

Design Goals

Should be as intuitive as possible
Must support multiple indexes covering the same field (a directive can only be applied once to a field)
Must support indexes across relations (including fields on just one side and both sides)

Single Model

one field

type User @model {
  id: ID! @isUnique
  name: String @index
}

# or specify sort order

type User @model {
  id: ID! @isUnique
  name: String @index(sort: ASC)
}

multiple fields

type User @model {
  id: ID! @isUnique
  name: String @index(groups: [{name: "combined_name_and_age", position: 0}])
  age: String @index(group: [{name: "combined_name_and_age", position: 1}])
}

Across relations

one field

type User @model {
  id: ID! @isUnique
  name: String @index(groups: [{relation: "Posts"}])
  posts: [Post!]! @relation(name: "Posts")
}

type Post @model {
  id: ID! @isUnique
  title: String
  creator: User! @relation(name: "Posts")
}

Improves performance of queries like:

{allPosts(filter:{creator:{name: "John"}})}

multiple fields

type User @model {
 id: ID! @isUnique
 name: String @index(groups: [{name: "combined_name_and_title", position: 0, relation: "Posts"}])
 posts: [Post!]! @relation(name: "Posts")
}

type Post @model {
 id: ID! @isUnique
 title: String @index(groups: [{name: "combined_name_and_title", position: 1, relation: "Posts"}])
 creator: User! @relation(name: "Posts")
}

Improves performance of queries like:

{allPosts(filter:{title: "Great Title", creator:{name: "John"}})}

Unique Index

To specify a unique constraint covering multiple fields, simply use the @unique directive instead of @index:

type User @model {
 id: ID! @isUnique
 name: String @unique(groups: [{name: "combined_name_and_age", position: 0}])
 age: String @unique(group: [{name: "combined_name_and_age", position: 1}])
}

Nov 19 '17 21:11 sorenbs

@sorenbs An excellent and thorough analysis!

Before I go any further, you might have a typo in the definition of the 'Post table':

type Post @model {
 id: ID! @isUnique
 title: String
 creator: Post! @relation(name: "Posts") // <-- should read 'creator: User! @relation(name: "Posts")?
}

I think the syntax also needs to cover simple non-join tables as per follows:

type Country @model {
  id: ID! 
  countryName: String! @isUnique
  regions: [Region!] @relation(name: "CountryRegions") 
  // Regions are States in the US, but 'Provinces' in other  countries
}

type Region @model {
  id: ID!
  country: Country!  @IsUnique(groups: [{name: "UK_CountryRegions", position: 0}])
  stateName: String! @isUnique(groups: [{name: "UK_CountryRegions", position: 1}]
}  // UK_ here means 'Unique Key'

Also, I do not think that generated M2M Join tables are a good idea. One thing I learned the hard way over the years, even after having been given this piece of advice, was that if a M2M join table is created solely for the purpose of joining, without any additional info in it, then the database analysis is probably missing some important feature, and needs to be looked at again. Some M2M join tables also sometimes join 3 or 4 tables.

In principle, a M2M join table needs to include some other qualifying columns, above and beyond the standard 'CreatedAt, CreatedByUser' etc. For eg:

Real-Estate Agents (who often work in pairs, and can work with agents from another Agency, under a deal arrangement), is mapped to a specific property, sometimes as the 'Lead' agent, sometimes as the 'Assistant'. So the M2M join table might look like:

type MapAgent2Property @model {
   id: ID! @isUnique
   agentID: ID!
   propertyID: ID!
   agentRole: RoleEnum! // 'Primary', 'Assistant', 'CoAgent' etc. 
   dealAgreementID: ID // The details of who gets what % of the sale
}

So what I am saying is that generated M2M tables are not really feasible. The DB analyst should be able to define the table properly, and define the necessary (multi-column) Unique keys to ensure consistency.

So, your proposed syntax is good, but does not need to be as complex as you are making it. I really like the 'sort_order' option as well.

Good job on the thorough analysis!

Nov 20 '17 16:11 Trellian

BTW, what MarkDown tag did you use to get the cool Markup on your types?

Nov 20 '17 16:11 Trellian

` ` ` graphql

Nov 20 '17 16:11 kbrandwijk

awesome, thanks! @kbrandwijk

Nov 20 '17 17:11 Trellian

Thanks @Trellian - some good thoughts!

First - I have corrected the typo.

About the M2M join table: A core design goal of Graphcool is to enable pain-free schema migrations preserving existing data.

One of the design decisions this has led to is that we always use a M2M-style join table when creating relationships. This way, transforming a 1-m relationship into a M2M is literally a matter of changing the type of a single field to a list. As we don't have to move data around, we can perform this operation quickly without downtime and without loosing data. This also allows us to introduce Kims proposal https://github.com/graphcool/framework/issues/746 in a way that does not require changing the underlying datastructure.

We should carefully consider if this level of flexibility is really required and if the tradeoff in performance is worth it, so thank you for bringing it up.

An alternative could be that relations by default never use a relation table, and M2M relations are not supported. If a relation table is required (either to add extra fields on the relation or to create a M2M relation), then the @edge type proposed by Kim is required.

Do you think that would be a better approach? My main concern is that this introduces extra complexity that makes it more difficult to get started with Graphcool.

Nov 20 '17 18:11 sorenbs

@sorenbs I like the @edge syntax. It's effectively defining a join table.

But I don't want to get off topic here, though. We're talking about having multi-column unique indexes/keys. Once you have those in place, a whole new world opens up. Without them, one can't produce a real-world database, except using the BeforeHook approach, which is effectively throwing away the support for such in the backend RDBMS, which every good RDBMS is exceptionally good at. Field Constraints, Table Constraints, Relation Constraints all belong in the RDBMS. That's what the RDBMS is there for, after all.

Can I suggest that we open a new topic for discussing the impact of Join tables? Please also note, that a Join table is just like any other, no need to abstract it out, really. If there is extra information on the 'edge' with a join table, then that info is useful. Even for the simplest case of just order-by.

Nov 21 '17 17:11 Trellian

@sorenbs I don't see https://github.com/graphcool/framework/issues/746 marked as 1.0 😄

Nov 21 '17 18:11 kbrandwijk

I agree with @Trellian that multi-column uniqueness constraints are far more important than having a join table for being able to easily convert a one to many 1-M relationship into a many to many M2M relationship. I need many to many relationships occasionally, so don't want to get rid of that ability and realize that would still need a join table, but IMO it's going to be pretty rare to want to transition between those join situations, and I'd be fine with that being manual and cumbersome (need to create new tables, migrate data, rename) if it meant I got multi-column constraints (and indexes) which I need all the time.

Nov 22 '17 07:11 mmrobins

The proposal mentions

To specify a unique constraint covering multiple fields, simply use the @unique directive instead of @index:

type User @model {
 id: ID! @isUnique
 name: String @unique(groups: [{name: "combined_name_and_age", position: 0}])
 age: String @unique(group: [{name: "combined_name_and_age", position: 1}])
}

Would this also allow for unique constraints spanning relations (also see #65), presumably with this syntax:

type User @model {
 id: ID! @unique(groups: [{name: "post_title_per_user", position: 0}])
 name: String 
 age: String 
}

type Post @model {
  id: ID!  @isUnique
  title: String! @unique(groups: [{name: "post_title_per_user", position: 1}])
}

Is there a difference between @isUnique and @unique?

Nov 23 '17 14:11 marktani

@marktani in answer to #1,

The necessary syntax is simpler than in #1 above, I think it should look like this:

type User @model {
 id: ID! @Unique
 name: String 
 age: String 
 Posts: [Post!] @relation(name: "user_posts")
}

type Post @model {
  id: ID!  @Unique
  author: User! @relation(name: "user_posts) @unique(groups: [{name: "post_title_per_user", position: 0}])
  title: String! @unique(groups: [{name: "post_title_per_user", position: 1}])
}

or maybe this would be simpler and easier to implement?:

type Post @model {
  id: ID!  @Unique
  author: User! @relation(name: "user_posts)
  title: String! 
  @unique({name: "UK_user_posts", columns: {"author", "title"})
// or even simpler, no need for a unique key name, most RBMS don't need them, or they `autogenerate`:
// @unique(columns: {"author", "title"})
}

Unique keys are Table constraints, and are never allowed to span tables directly in RDBMS theory.

For #2, there is no difference between @isUnique and @Unique

Nov 23 '17 15:11 Trellian

How will multi-field unique constraints be reflected in where input objects?

Example schema:

type Post {
 id: ID! @unique
 title: String @unique(groups: [{name: "combined_title_and_slug", position: 0}])
 slug: String @unique(group: [{name: "combined_title_and_slug", position: 1}])
}

Query by title:

query {
  post(where: {
    title: "My biggest adventure"
  }) {
    id
    title
    published
  }
}

Query by slug:

query {
  post(where: {
    id: "my-biggest-adventure"
  }) {
    id
    title
    published
  }
}

Is there a way to fetch a post by a unique title + slug combination? I thought of this:

query {
  post(where: {
    combined_title_and_slug: {
      slug: "my-biggest-adventure"
      title: "My biggest adventure"
    }
  }) {
    id
    title
    published
  }
}

Jan 02 '18 16:01 marktani

I'd just like to mention that I'm really hurting for a compound primary key at the moment to provide an upsert on these fields

Mar 15 '18 01:03 luxaritas

Yes, echoing the need for a multi-column index to enforce uniqueness.

Right now we are manually enforcing the combined uniqueness by performing a "where" at all places a creation can take place.

Jun 11 '18 08:06 roycclu

what about @fulltext()index?

Jun 19 '18 13:06 terion-name

+1 for multi-column index and unique indexes!

Jun 21 '18 12:06 eliezedeck

Suggestion looks good!

Is anything happening here? How can we help?

About multi column unique indexes, which i'm in dire need of at the moment: Why have 2 decorators @isUnique and @unique - could that possibly be done with a single one?

Jun 24 '18 07:06 codepunkt

Generally agree Foreign Keys, multi column index, and constraints are super important in a robust for production RBDMS. I dont understand the design decision for defaulting to join tables, seems optimizing for the wrong thing, 80-20 rule, relational dbs are ALL about relationships, native on those entities, FKs are important, if u need an entire new table every time u need a simple FK that is a huge overhead, huge performance impact adding joins to SQL, and just generally hard to work with engineering wise, especially coming into a project with a big graph/schema.

Jul 08 '18 23:07 bjm88

Adding to @bjm88 I really don't understand, how this feature could be not in the list of top-priorities. Running a production DB without keys and constraints and any kind of load more than nothing is absurd. And without compound indexes in lot cases you just can't guarantee data consistency.

Jul 08 '18 23:07 terion-name

@terion-name That's the argument I have been trying to make for months, and have been shot down repeatedly. It is such a fundamental feature that I seriously wonder if the guys behind graphcool have any real-world experience at all. I'm not trying to be nasty here. It's just an observation. I wouldn't even consider releasing a product like this without those basic features. It will be dead in the water on the first even slightly non-trivial database. You wouldn't even be able to get the basic old-favourite 'video-store' DB to work properly. @bjm88 I agree totally with you too. The solution as it stands is totally unworkable in the real world.

Jul 10 '18 19:07 Trellian

I started a medium-sized project with Prisma ... just after 3 days of inputting data into it, I started to notice that it chocked badly, because? ... no indexes. And there was a weird Apollo client bug which rendered everything just inconsistent. I thus gave-up and simply moved to another stack instead.

Not that I'm too impatient, but I have noticed in the past that features you ask don't usually get implement as eagerly as you'll want them to be (in the days of Graphcool and now), even if it seems important. Sure, this is open-source, but I don't have time to learn Scala and I don't see the need to. And there's the internal latency with Prisma, which seems to be an additional 100ms.

Well, all those things adding-up together: I had to explore my options.

Jul 11 '18 06:07 eliezedeck

@marktani please, this feature is absolutely crucial for production, a first-tear. no sense in adding connectors or any features if it just can't utilize db performance.

Jul 12 '18 07:07 terion-name

Switching to other solutions because of the lack of this feature.

Jul 12 '18 08:07 semenovDL

Hi Prisma team, I think you are seeing that building an ORM it is crucial to get relationships and performance right working with databases. This project has such potential, I really hope you consider doing organic foreign keys on tables themselves and taking index setup and management seriously. Have you considered using GitCoin to have people be able to pay for features to help support them. I suspect this is actually not hard to implement, many other ORMs do it, but I simply cannot use this library without these basic design principles for working with a db, would be happy to support with $ if you enable it.

Jul 12 '18 13:07 bjm88

Adding in my vote for this ~~feature~~ requirement. Composite foreign keys, indexes and the implementation of proper RDBMS architecture are a must for this project.

Aug 02 '18 08:08 develomark

For those that run into performance problems using Postgres, check out Hasura as an option. All things considered, i like prisma better - but when performance is important, this unfortunately is on the slower side atm.

Aug 02 '18 11:08 codepunkt

Hey all,

thanks for sharing your concerns with us! :pray: First of all i would like to say that performance is an important topic for us and internally we are discussing this all the time. We are committed to improving performance as a continuous effort.

However indexes are a just one facet of performance. During the last weeks we created an extensive benchmarking suite which we used to drive our decisions on which areas must be improved first performance wise. In our tests just adding indexes did not help performance as much as we would like it to. Instead we identified our SQL queries and application code as the biggest problems right now.

Therefore we spent a lot of time during the last weeks to refactor our SQL queries to be more performant. In addition we profiled the Prisma application code a lot and were able to implement significant improvements for CPU and memory consumption. The first big batch of those improvements is landing in our release 1.14 which went into the beta stage this week. 🚀

If you are eager to give it a spin just use the tag/version 1.14-beta. We have some more improvements in the pipeline for the subsequent releases as well. When we are through with all of those changes we have planned, we will turn to indexes again and i am confident that they will then bring additional improvements.

And there is one last thing around indexes: If you think that indexes are absolutely crucial you always have the possibility to just connect to your database and create them manually. I agree that this is not the best experience yet, but we really want to get indexes right before we ship anything half baked to our users.

PS: In case you hit any performance problems with 1.14 or later, we would really like to hear about it. Just contact me on our public Slack (@marcus)so we can learn about your concrete problems and can come up with solutions in future releases. And i can probably give you a hint or two how your GraphQL query could be optimised 😃

Aug 02 '18 12:08 mavilein

@mavilein thank you for the detailed update. Can you speak to the compound primary key aspect of the discussion, please?

I.e. the ability to apply @unique to a combination of fields for a relation.

This aspect is the more limiting factor in terms of usability.

Re:https://github.com/prismagraphql/prisma/issues/171

Aug 03 '18 09:08 develomark

@mavilein performant queries is a good thing, but saying that indexes are not efficient — is not correct. Now we simply can't make compound indexes (like unique(slug, lang)) or fulltext indexes. We just can't optimize any selects beyond selections by ID (price range selects, date period selects and many other that are used in real-life applications). What is the problem? Why indexes is a problem for almost a year? It is far more easy thing than optimizing query builder and it is absolutely crucial and should be one of first features. And manual indexes just don't fit in current migrating model (that also needs to be changed, but it is a separate topic). In fact in any migrating model. It's just not an option for development cycle

Aug 05 '18 13:08 terion-name

@mavilein, if I create the unique constraints manually as you suggest, what are the chances that I will have problems later with e.g. data migrations? Thanks

Sep 02 '18 19:09 marcovc

@sorenbs

We should carefully consider if this level of flexibility is really required and if the tradeoff in performance is worth it, so thank you for bringing it up.

An alternative could be that relations by default never use a relation table, and M2M relations are not supported. If a relation table is required (either to add extra fields on the relation or to create a M2M relation), then the @edge type proposed by Kim is required.

Do you think that would be a better approach? My main concern is that this introduces extra complexity that makes it more difficult to get started with Graphcool.

Did you guys ever have a meeting(s) about this and pick a direction? These are very good questions that will dictate a ton of code you guys write, as well as how successful Prisma is.

fwiw, the "M2M everything!" approach you're currently drifting along with was the reason we jumped ship.

My main concern is that this introduces extra complexity that makes it more difficult to get started with Graphcool.

The complexity would be worth it. (I'd trade the complexity of developing/supporting 50 connectors to just supporting Postgres.)

Sep 19 '18 18:09 corysimmons

prisma1 prisma1 copied to clipboard

Improving performance with Indexes

Indexes currently being created

primary key

unique fields

Example

relations

User

Post

Posts

Simplifying indexes on relation tables

Remove the id coulmn

Two indexes is enough to provide fast traversal

Complexity Analysis

Single Model

non unique field

two fields

range query

range query and equality

range query on two fields

Multiple Models

simple relation

relation with single equality

relation with range query

Mirror fields to relation table

User

Post

Posts

Specifying indexes in types.graphql

Design Goals

Single Model

one field

multiple fields

Across relations

one field

multiple fields

Unique Index

prisma1
prisma1 copied to clipboard

Specifying indexes in `types.graphql`