V5 optimize setof function performance ?
Feature description
As mentioned here: custom-queries
LIMIT/OFFSET pagination has performance issue when over 100,000 records.
In V4, we implement our custom setof function in such a complicated way.
I hope the V5 can provide some kind of framework to create optimized setof function.
Motivating example
Graphile is very powerful, and performance is very fast.
But the setof function is some kind of hidden bomb.
In our recent project, development stage is good.
But after deploy to production, import 1,000,000 real data records.
The setof function become very slow.
we have tested it, OFFSET 900,000 records limit 10, need over 15s. It is very slow and unacceptable. to solve it, currently, we implement our custom-queries by a very complicated way.
here is out V4 solution:
Our V4 custom-queries
As the doc says: One way to solve this is to have your function apply its own internal limits and filters which can be exposed as GraphQL field arguments - if you reduce the amount of data that the function can produce (e.g. to 100 rows) then it reduces the potential cost of having this function in your schema.
The primary goal is keeping the setof function input/output no breaking change. we implement our custom-queries in such a complicated way:
1, pass args to setof function (as graphile setof function: first, offset, last, after, before)
CREATE OR REPLACE FUNCTION test.fn_test_setof_optimize_aaa(_filters jsonb[] DEFAULT NULL,
_orders jsonb[] DEFAULT NULL,
first int DEFAULT NULL,
"offset" int DEFAULT NULL,
last int DEFAULT NULL,
after text DEFAULT NULL,
before text DEFAULT NULL)
RETURNS test.test_aaa_page_connection -- return type is custom, not setof
2, add hook pluin to make gql Cursor type for after and before
because text type can't be recognize as GQL Cursor type, need a pluin to do this:
module.exports = makeProcessSchemaPlugin(schema => {
const stringField = schema._typeMap.String
const fileds = schema._typeMap.Query._fields
const fnNames = Object.keys(schema._typeMap.Query._fields)
for (let fnName of fnNames) {
if (fnName.startsWith('fn')) {
const item = fileds[fnName]
const args = item.args
if (Array.isArray(args)) {
for (let arg of args) {
const type = arg.type
if (['after', 'before'].includes(arg.name)
&& typeof type === 'object' && type !== null
&& typeof stringField === 'object' && stringField !== null
&& type.constructor === stringField.constructor) {
arg.type = schema._typeMap.Cursor
}
}
}
}
}
return schema;
});
3, add setof context at pg_setting
const gql = require('graphql-tag')
pgSettings: async (req) => {
let qlParsed = gql`${gqlString}`
// recursive function
setContextOpt(gqlParsed, 'definitions[0].selectionSet.selections', [])
// ...
const fieldsToCheck = ['totalCount', 'nodes', 'edges', 'pageInfo']
//...
subSelections.forEach(subSelection => {
const subSelectionName = _.get(subSelection, 'name.value', '')
if (fieldsToCheck.includes(subSelectionName)) {
ret[`opt.setof.${underscoreName}.has${subSelectionName.charAt(0).toUpperCase() + subSelectionName.slice(1)}`] = true
}
})
//...
// pgSettings output like:
// opt.setof.fn_test_setof_optimize_aaa.hasTotalCount: true
// opt.setof.fn_test_setof_optimize_aaa.hasNodes: true
4, create setof custom return type (actually, this should be step 1)
CREATE TYPE test.type_test_aaa_edge AS (
cursor varchar,
node test.test_aaa
);
CREATE TYPE test.test_aaa_page_connection AS (
nodes test.test_aaa[],
edges test.type_test_aaa_edge[],
total_count INT,
page_info basic.type_basic_page_info
);
5, finally, optimize the custom function
IF COALESCE(CURRENT_SETTING('opt.setof.' || 'fn_test_setof_optimize_aaa' || '.hasTotalCount', TRUE), '') = 'true' THEN
-- ...
-- set total_count in test.test_aaa_page_connection
END IF;
IF COALESCE(CURRENT_SETTING('opt.setof.' || 'fn_test_setof_optimize_aaa' || '.hasNodes', TRUE), '') = 'true'
OR COALESCE(CURRENT_SETTING('opt.setof.' || 'fn_test_setof_optimize_aaa' || '.hasEdges', TRUE), '') = 'true'
OR COALESCE(CURRENT_SETTING('opt.setof.' || 'fn_test_setof_optimize_aaa' || '.hasPageInfo', TRUE), '') = 'true'
THEN
-- ...
-- use offset and first, optimize use index, two steps query, only return need records, like 10 records
-- use after and first, fast cursor location query, best performance
-- create own cursor text for each edge
-- get one more to check hasNextPage
END IF;
Also, need other optimize _orders and _filter. and _uniq_key to create cursor. and Multicolumn Indexes for performance.
After those optimization, we success reduce the function time from 15s to 50~500ms. And client no need change any code.
suggest V5 solution
No need step 1~4, just use smart tag, only need implement step 5 pg function.
like this smart tags:
CREATE OR REPLACE FUNCTION test.fn_test_aa_bb_cc(...)
RETURNS jsonb -- include: total_count, nodes, edges, page_info
COMMENT ON FUNCTION test.fn_test_aa_bb_cc IS
E'@setofFnOpt\n@setofFnOptOutType test.test_table_kkk';
one more question
is it possible implement it by smart tag using plugin do all stuff in V4? currently, this solution use hook plugin, pgSetting function, pg custom type, seems not elegant.
The most correct way to address this is to replace the function with a schema plugin that does the right thing and isn't a "black box" from PostgreSQL optimization point of view. I'd strongly encourage this approach, especially for "computed column" functions where they may be called multiple times in the same GraphQL request (once for each row in a parent collection). You get to retain all of the flexibility of filters/conditions/cursor pagination whilst also fixing your performance issues.
Another option would be to expose the function without all the regular pagination stuff and instead only expose the arguments the function itself has. You can then add whatever pagination features makes sense to your function via its arguments (limit, offset, orderBy, etc) without worrying about what PostGraphile supports natively. Doing this with a connection is more challenging though.
The most correct way to address this is to replace the function with a schema plugin that does the right thing and isn't a "black box" from PostgreSQL optimization point of view.
I'm not quite sure what kind of schema plugin isn't a "black box". You mean write pagination code in nodejs not in PG?
If you look at my solution, it's kind of patch for setof function framework. Most importantly, this patch does not cause a breaking change to the input and output automatically generated by the original setof, so clients do not need to modify their code. It's common method for all my setof function. Make the "black box" aware of the context, only obtain necessary data. Actually, step 1-3, only need write once. New setof function only need implement step 4-5. Currently, for data with fewer than 100,000 entries, we retain the original setof functions. For data with more than 100,000 entries, we enable the setof function optimization patch.
This solution works, make performance from 15s to 50~500ms. I put forward this feature request because I feel that my current solution is not very elegant. Every time, the new setof function needs to define a new connection type in PG. It would be better if the PostGraphile itself provides such a setof function optimization framework.
The reason why I think this setof optimization framework is important is that in real projects, 60% of the functions are about obtaining various list data (pagination data), which are setof functions. If the PostGraphile library itself does not provide an optimization framework for setof, then I feel that this library is more of a technical library rather than a product-oriented library.
What I mean is not that the framework should directly optimize the performance of setof functions, but that it should provide a method design pattern, allowing developers to quickly implement performance optimization for setof functions based on certain rules.
Thank you for raising this issue, it has triggered me to improve coverage in the documentation - I've written more detail of this, including an example, up at: https://github.com/graphile/crystal/pull/2729; you can view the (somewhat) rendered markdown here: https://github.com/graphile/crystal/blob/func-perf/postgraphile/website/postgraphile/functions.md#function-inlining-or-lack-thereof
I'm not quite sure what kind of schema plugin isn't a "black box". You mean write pagination code in nodejs not in PG?
I think you've misunderstood me entirely, maybe you think I'm using "black box" as an opinionated phrase like "code smell". I am not; I'm talking about how PostgreSQL optimizes queries.
From the PostgreSQL planner point of view, most functions are a "black box" in terms of optimization, so Postgres is forced to call them each time, and cannot optimize them into the query plan - it can't turn them into a subquery or join, it can't push "order by" or "where" clauses down into the inner query, it cannot really optimize them at all - they're literally a "black box" as far as the PostgreSQL optimizer is concerned. This can often result in really poor performance. Not least because if you apply an "order by" to a function, it must first compute the entire results of the function, store them into temporary storage (which may or may not result in writing them to disk - but in your case very likely does due to the quantity of data!) and then apply ordering to this result set.
This is not true of all functions, some can be inlined - to see a rough set of rules, please have a read of this wiki page:
https://wiki.postgresql.org/wiki/Inlining_of_SQL_functions
Plugins, on the other hand, inject SQL directly into the query - and since the SQL is explicitly in the query issued to the database, PostgreSQL can do its normal query rewriting and plan optimization against it. As such, from the PostgreSQL's planner perspective, SQL added by plugins is an open book, ripe for optimization.
Given this information: when you're hitting performance issues with SQL functions, and in particular with "computed column" SQL functions, your best bet is typically to move them from database functions into schema plugins.
It would be better if the PostGraphile itself provides such a setof function optimization framework.
I agree; however there's a lot of complexity in building out functions in the database that are exposed as GraphQL connections. What you've achieved is quite impressive ā and very well done for achieving it! ā but it exposes a lot of the complexities of GraphQL cursor pagination to the database, and there's a huge amount of subtleties in getting this right - especially if you want things like aggregates, custom ordering, advanced filtering plugins, etc.
This is why we are instead focusing on making it so you can move the presentational logic into the JS layer, such that PostgreSQL can effectively optimize your queries for you. I believe that the plugin approach has far fewer trade-offs than the in-db version.
Imagine a poorly performing function such as:
create function users_filtered_things(u users, include_archived boolean)
returns setof things as $$
declare
sql text;
begin
-- Dynamically construct our SQL
sql := format(
'select t.* from things t where t.user_id = $1%s',
case when include_archived then '' else ' and archived_at is null' end
);
-- Execute the SQL
return query execute sql using u.id;
end;
$$ language plpgsql stable;
We could transform this function as you describe, or we can port just its existing logic to a plugin:
export const MyPlugin = extendSchema((build) => {
const {
sql,
grafast: { lambda },
pgResources: { things },
} = build;
return {
typeDefs: /* GraphQL */ `
extend type User {
filteredThings(includeArchived: Boolean): ThingsConnection!
}
`,
objects: {
User: {
plans: {
filteredThings($user, { $includeArchived }) {
// Create a pgSelect step representing all things owned by this user:
const $things = things.find({ user_id: $user.get("id") });
// Apply dynamic changes to the query based on runtime values:
$things.apply(
lambda(
$includeArchived /* ā plan-time step */,
(includeArchived /* ā runtime value */) =>
// At runtime, this callback will be called to manipulate the
// SQL query before it is executed
(queryBuilder) => {
if (!includeArchived) {
queryBuilder.where(sql`archived_at is null`);
}
},
),
);
// Wrap this collection in a connection
return connection($things);
},
},
},
},
};
});
If the PostGraphile library itself does not provide an optimization framework for setof, then I feel that this library is more of a technical library rather than a product-oriented library.
The plugin approach means the developer doesn't need to worry about totalCount, cursors, orderBy, condition, filter, and the rest - it's automatically handled for them. This is precisely because we are a product-oriented library. One of the design goals of PostGraphile is to put the shape of the GraphQL schema first and give you as much control over it as possible. It's aim is not to enable you to do everything you need to do with just SQL, otherwise we wouldn't have built such a powerful plugin system.
For setof functions, doing the work above in a plugin is, I would argue, significantly easier with plugins than it is mapping all the complexity into PostgreSQL logic and types.
Note also that the code above is using generic extension of the API via extendSchema() and thus has a lot of boilerplate - over time we will hopefully make new plugin factories that can help you accomplish the same task in far fewer lines of code. You could imagine in future we might add something like:
const myPlugin = extendSchema(/* GraphQL */`
extend type User {
filteredThings(includeArchived: Boolean): ThingsConnection!
@pg(alias: "t")
@pg(where: "t.user_id = $User.id")
@pg(where: "t.archived_at is null", unless: "$$includeArchived")
@pg(orderBy: "t.weight desc")
}
`);
Thank you for your patient explanation. I think I might have understood the principle behind the schema plugin solution: optimizing performance when generating SQL at the JS layer, avoiding the use of PG's setof functions (since PG's offset/limit has always been a performance bottleneck), and migrating the logic code from PG functions to the part where JS generates SQLāessentially deconstructing the "black box" functions. The PostgreSQL planner is better suited for optimizing the performance of SQL statements rather than "black box" functions. From a technical perspective, this is indeed a more excellent solution.
Initially, I also considered optimizing performance at the SQL statement generation point (the plugin), but this brings up another bigger question: why do we use the PostGraphile library? This is a bit of a long story.
Itās mainly based on three reasons: stability, efficiency, and convenience. In the past, all our server code was written at the JS layer, PG is only used for data storage. After having hundreds of JS logic business processing functions (especially when the skills of the developers writing them varied), the JS server became very unstable or extremely slow, due to various reasons. After trying many solutions, we finally chose Koa + PostGraphile to build our new server. The core design is: Node.js is used for the public framework, proxying API and GQL access to PG; business logic code is written in PG functions as much as possible(Let the PostGraphile library automatically translate them into GQL interfaces).
The biggest advantages of this new architecture are that the whole system has become more efficient and stable. For example:
- Setting
ignoreIndexes: falseensures the performance of automatically generated GQL interfaces. - Even in the worst-case scenario, a custom
fnfunction will only fail due to an exception in an independent PG request process, without endangering the entire service. (In the past, a bug in just one JS function could crash the entire JS service process at any time.) - Convenience of code updates: We only need to modify
CREATE OR REPLACE FUNCTIONin PG, and the PostGraphile library can automatically detect and reload it. (In the past, every time we modified a function, we had to repackage, compile, deploy, and restart the JS serviceāthis was too inefficient.)
Going back to the solution you proposed: technically, itās very good. But when it comes to actual business code, it may not be as simple as the sample code. This will lead to these problems:
- A single business logic
plpgsqlPG function may involve: args handling, authentication, multi-table queries, intermediate variables, different IF-ELSE branches, LOOP, data assembly, etc. Turning these into concatenated SQL statements will be an extremely difficult task. Moreover, technically, someplpgsqlcode cannot be fully implemented with dynamically stitched SQL code. - Understanding the code of a PG function is far better than understanding JS stitched SQL code. Especially when it is necessary to maintain these codes in the later stages of the project.
- Based on our past experience, the more business logic code there is in JS, the more unstable the server becomes (which, of course, is related to the developers' skill levels).
- The issue of compilation and deployment: Previously, we only needed to use
CREATE OR REPLACE FUNCTION; after migrating to the JS layer, even the tiniest modification requires going through the deployment process again, which is too inefficient. The most fatal thing is that it requires restarting the JS server.
This is why my solution (although it is very complex) was to optimize performance by passing context inside the PG-layer setof function (the "black box")ābecause I donāt want to break the existing overall framework, and I want to keep the JS layer simple. This is also why I said we need a product-oriented library; PostGraphile fits this very well, but I think PostGraphile can do even better.
Additionally, i think the "computed column" performance should be a separate issue for discussion (as it may require a different solution); what I primarily want to focus on here is the performance issue of the setof function.
The business logic of the plpgsql function can almost certainly be replicated in the JS layer unless it's a security definer function (and you shouldn't have people writing those unless you 100% trust their skill levels). My example was simple because you've not presented a before and after for a function following your framework, only the instructions; if you have a more complex example I'd love to discuss it concretely. Ideally if you could provide a full schema that can run against https://github.com/benjie/ouch-my-finger that would help make the discussion more concrete, and could even serve as the basis of a test suite when this feature gets integrated into PostGraphile!
But yes, I would very much like to optimize setof functions themselves (not least for the security definer case), however there is a LOT to be aware of here (as I previously raised) and my previous attempts led to significant logic having to be pushed into the functions themselves - with people having to understand how to construct and consume cursors, how to digest complex order by, etc. I recently overhauled connections so I was planning to revisit this with simplified pagination, but right now my priority is on getting V5 out the door.
Once again, I still insist that such optimizations should not be implemented at the JS layer, as this goes against the original intention behind my choice of PostGraphile.
I understand that optimizing the performance of setof functions at the PG layer is a complex issue involving multiple factors. The solution I proposed, which incorporates the setof function context, can serve as an approach to address its performance issues. I hope you can consider it as an optional choice if you plan to add performance optimization features for setof functions later on.
I am looking forward to the release of V5 and hope to see if there will be a better solution to this problem under the new architecture at that time.
Please do provide a full sample schema (with before and after) for us to work with.
I still insist that such optimizations should not be implemented at the JS layer, as this goes against the original intention behind my choice of PostGraphile.
In my opinion it should be possible both ways so people can make their choice. I strongly disagree that it "should not be implemented at the JS layer", but I also agree that you should be able to do it at the Postgres layer if you want, and if we can achieve it without too many trade-offs.
Currently the best option that is available, in my opinion, is via the JS layer.
For a Postgres solution, I do have a few opinions:
- I don't think GraphQL semantics should leak too much into the database - IMO the database functions should be usable from any consumer of the database.
- I don't think we should be handing things around using the Postgres JSON/JSONB types because they're essentially untyped and far more likely to result in software bugs
- I think functions should only expose what's performant, so we should try and embrace the regular function arguments as much as possible (e.g. don't add PostGraphile's condition/filter arguments, instead expose only what the function exposes), but this does mean that it would use different types which makes evolution more challenging (e.g. moving to JS layer later)
- Similarly, functions probably shouldn't offer aggregates unless computing the full function result is cheap. This also relates to the type used to represent the function result (e.g. a per-table
FoosConnectionwith aggregates vs a per-functionMyFunctionNameConnectionwithout)
I do plan to work on making it so that setof functions can be implemented more efficiently, but there's a lot of things that I'd like to keep in mind whilst doing so and the solutions to them are currently unclear.
I agree that two solutionsāJS and PG āshould be provided for users to choose from. In our specific usage scenario, we may lean more toward the PG version.
This is the simplest implementation example of our solution. The core idea is actually very straightforward:
- Retrieve the context from
current_settingto understand which data needs to be returned, usingfirstto understand how much of it should be returned, usingoffset,afterto locate pagination. This fundamentally solve the performance issue of the setof function(Compared with old versionsetoffunction ,RETURN QUERYreturn all data.). - Return the data in a fixed format.
(The somewhat complex part: generating the unique Cursor required for page_info and edges based on _uniq_orders(primary key) and _orders)
Here is code (There is very few data in the default_portrait table; I only used this function for demonstration purposes):
-- Step 1: Define Edge type
CREATE TYPE basic.type_get_page_user_default_portrait_edge AS
(
cursor varchar,
node basic.basic_user_default_portrait
);
-- Step 2: Define Connection type
CREATE TYPE basic.type_get_page_user_default_portrait_connection AS
(
nodes basic.basic_user_default_portrait[],
edges basic.type_get_page_user_default_portrait_edge[],
total_count INT,
page_info basic.type_basic_page_info
);
-- Step 3: Define fn function (simplest version)
DROP FUNCTION IF EXISTS basic.fn_basic_get_page_user_default_portrait_simple;
CREATE OR REPLACE FUNCTION basic.fn_basic_get_page_user_default_portrait_simple(first int DEFAULT 10,
"offset" int DEFAULT NULL,
after text DEFAULT NULL)
RETURNS basic.type_get_page_user_default_portrait_connection AS
$$
DECLARE
v_result basic.type_get_page_user_default_portrait_connection;
v_page_result_with_total basic.type_basic_page_result_with_total;
v_generic_result basic.type_basic_page_result_generic;
BEGIN
v_page_result_with_total = basic.ifn_basic_get_page(
-- _fn_name: get context from current_setting, optimize performance by querying only the required data
-- COALESCE(CURRENT_SETTING('opt.setof.' || 'fn_basic_get_page_user_default_portrait_simple' || '.hasTotalCount', TRUE), '')
-- COALESCE(CURRENT_SETTING('opt.setof.' || 'fn_basic_get_page_user_default_portrait_simple' || '.hasNodes', TRUE), '')
-- COALESCE(CURRENT_SETTING('opt.setof.' || 'fn_basic_get_page_user_default_portrait_simple' || '.hasEdges', TRUE), '')
-- COALESCE(CURRENT_SETTING('opt.setof.' || 'fn_basic_get_page_user_default_portrait_simple' || '.hasPageInfo', TRUE), '')
'fn_basic_get_page_user_default_portrait_simple',
-- _base_sql
'select * from basic.basic_user_default_portrait where enable = 1',
-- _uniq_orders(primary key): work with _orders for generating a unique Cursor text
ARRAY ['{"key": "id", "type": "ASC"}']::jsonb[],
-- pass arg: first
first,
-- pass arg: offset
"offset",
-- pass arg: after
after,
-- _orders
ARRAY ['{"key": "ct", "type": "DESC"}']::jsonb[],
-- _filters
NULL,
-- _options
NULL
);
-- Extract results
-- total_count
v_result.total_count = v_page_result_with_total.total_count;
v_generic_result = v_page_result_with_total.result;
-- page_info
v_result.page_info = v_generic_result.page_info;
-- nodes
SELECT COALESCE(ARRAY_AGG(JSONB_POPULATE_RECORD(NULL::basic.basic_user_default_portrait, node_element)), '{}')
INTO v_result.nodes
FROM JSONB_ARRAY_ELEMENTS(v_generic_result.nodes) AS node_element;
-- edges
SELECT COALESCE(ARRAY_AGG(ROW (edge_element ->> 'f1', JSONB_POPULATE_RECORD(NULL::basic.basic_user_default_portrait,
edge_element ->
'f2'))::basic.type_get_page_user_default_portrait_edge),
'{}')
INTO v_result.edges
FROM JSONB_ARRAY_ELEMENTS(v_generic_result.edges) AS edge_element;
RETURN v_result;
END;
$$ LANGUAGE plpgsql STABLE
SECURITY DEFINER;
Since performance is a key goal, I strongly recommend that you move from using jsonb_ to json_ methods; in my benchmarking they're about 5x faster on output (I think this is because json_ is essentially just text, so concatenation and conversion to text is trivial; jsonb has more overhead because it needs to be converted to binary and back, probably with additional validation steps).
Retrieve the context from current_setting to understand which data needs to be returned, using first to understand how much of it should be returned, using offset,after to locate pagination. This fundamentally solve the performance issue of the setof function(Compared with old version setof function , RETURN QUERY return all data.).
This won't work if you have a nested structure such as Hacker News or Reddit-style nested comments:
{
comment(id: 1) {
text
comments(first: 10, after: "...") {
nodes {
text
comments(first: 2, after: "...") {
Here the same function would be called in nested positions with different arguments, so you'd have to break this up into separate SQL queries to allow you to pass the relevant arguments down via settings.
I'd strongly encourage you to pass parameters via parameters, rather than via ambient ("global") settings.
This won't work if you have a nested structure such as Hacker News or Reddit-style nested comments
Please take a close look at my sample code. The CURRENT_SETTING context only includes logical variables such as xxx.hasTotalCount, xxx.hasNodes, xxx.hasEdges, and xxx.hasPageInfo. In contrast, parameters like first, offset, and after are passed to function directly. The problem you mentioned will not occur. (In nested scenarios, the worst case is just fetching a bit more data. For example, if there is no totalCount at the 2nd level, but it is fetched at the 1st level, then all the same name function will retrieve totalCount)
The CURRENT_SETTING context is a feature we added later, designed to streamline the returned data. For instance, take the simplest request: if you only input first and offset and expect nodes as the output. The context ensure that the fn function doesnāt need to fetch data like totalCount, edges, or pageInfo, and certainly doesnāt need to generate the text for the Cursor either. Fields for which no return is required will directly return NULL, which can reduce SQL query times and optimize performance in terms of data volume.
I haven't conducted actual tests on the performance of jsonb_ to json_. However, in list(pagination) scenarios, we usually fetch 10 or 20 items per page, so the performance difference should not be significant. Of course, if ultimate performance is a consideration, we should switch to json_. The reason for using jsonb_ is simply because it has more powerful functions and it has become a habit. we will try json_ later.