graphql-schema-registry
graphql-schema-registry copied to clipboard
Add schema definition breakdown feature
Hello @tot-ra :wave:
We are planning to add some features mentioned in the roadmap, starting by adding a schema usage breakdown (Query, Mutations, Scalars, Objects...). The aim is to store all the schema definitions to be able to display them and also allow next steps such as usage tracking.
Backend changes
We would like to create some new MySQL tables with the distribution showed below:
Having this, we could parse the
type_defs
received in the schema/push
endpoint to the new tables. Furthermore we will add new graphQL queries for the frontend to consume this data.
Frontend changes
Consume the backend new graphQL queries to present all the schema definitions adding new pages.
Thank you for your time, and nice work! :smiley:
Hey. Thx for the interesting topic, we do have schema usage in PD ourselves too (though its not so nested). Some questions:
- How is UI/API going to look like?
- Lets say I make a query
{ user { name } }
. Do you plan to insert new record for every query into operation, type & field tables? Otherwise I don't see exactly how you're gettingschema usage breakdown
(by property). If you do that, then this is not going to scale very well, because we can get a lot of queries & a lot of properties. - why do you need fields like
is_nullable
andis_array
[for usage]?
Hello!
- How is UI/API going to look like?
It is going to be similar to other solutions on the market, I can share tomorrow some prototyping.
- Do you plan to insert new record for every query into operation, type & field tables?
That is the plan, the idea is to have control about what are the fields inside each operation , this should not be an issue, on the other hand we plan to store the usage of those fields on the operations but it is not going to be stored forever, the idea is to have records from the last 30 days otherwise can be a decrease in the performance
Hello 👋
- why do you need fields like is_nullable and is_array [for usage]?
We decided to add this columns in the fields table, because we are planning to be able to represent data like the following -> [String ! ] !. So for this example, it will be is_array=true, is_nullable=false and is_array_nullable=false
- { user { name } }
For this example, we will need to store the query on the operation
table, the name on the field
table and assuming name is type String, we also need to add the String in the type
column as Scalar. With all of that data stored, we can know the usage for the Query, and also for the attribute name. Because we will be able to register the usages on the requested_fields
and requested_operations
what are the fields inside each operation we also need to add the String in the...
Thats not going to scale. Here at pipedrive, we serve >8k requests per minute. Thats 8k amount of INSERTS just if you assume its one field requested. At that rate, your mysql table would have 345M rows by the end of the month..
I would suggest to consider this kind of architecture:
- Gateway needs to send requested query to some queue (pubsub redis or better kafka)
- Then some piece of code, preferably written in golang that can efficiently utilize all CPUs, would fetch the query, parse it to AST tree, use graphql's visitor, go through all graph nodes, increase property count (usage) & store it in memory
- in graphql visitor you need to map queried field onto current live schema, because
{ user{name} }
has no knowledge ofUser
type - Then once in ~1 minute, it would take data from memory and flush it to mysql (with bulk insert)
- Basic & most valuable information is hits per day per property (
User.name: 1
) - Periodically, you need to cleanup old usage info. I'd suggest 5 day usage retention. But for smaller projects, I guess it makes sense to have 30 days.
- The more granular & more connected data you need, the bigger disk space you need. So these values should be configureable. But Ideally you shouldn't have more than 1M rows in a table.
- As I mentioned golang, It doesn't matter that much, as long as this processing can be moved to a separate dockerized process. DB can remain the same.
Hello :)
We discussed internally about your suggestion and right now we are going to focus on the breakdown queries when receiving a schema/push
endpoint. Meanwhile we are going to explore a new solution for the schema usage and share it on this thread again.
Thanks for the patience
breakdown queries
do you mean when someone pushes the schema, you want to parse type_defs and save it into relational form? I guess that may help to build UI where you can focus on specific entity or property (like apollo studio does). The possible problem there is that it may be inconsistent with actual type_defs that are stored as text. So I assume text form will remain source of truth.
Exactly as you said. We will store everything on the database tables, to be able to display the model similar to apollo studio does. And yes, the text form will be the source of truth.
Hello @tot-ra , as mentioned before, we are going to start working on the break down feature, before planning the schema usage feature. We would like to know your opinion when a schema is being updated (new type, modifying a field, removing a query...), if we encounter a breaking change, as we don't know if the change is being used by anyone, we are planning to add a header on the /schema/push
http POST as a "force" mechanism to allow the schema update. By default, will be false, so if we encounter a breaking change, it won't be possible to update the schema.
As soon as the usage feature is working, we will change this behaviour, and will be only valid to update an schema in case the breaking change is not being used.
closing this, lets continue in https://github.com/pipedrive/graphql-schema-registry/issues/146