pxf
pxf copied to clipboard
Using a Named Query with Oracle - how exactly works
Hello,
I have a question about usage of PXF - JDBC, namely the possibility of limiting the data that is read on the source side. Going into details, there are two places in the documentation which describe the rule/command of how the attributes work ("query:<query_name>"):
- VMware-Tanzu-Greenplum-Platform-Extension-Framework/6.3/tanzu-greenplum-platform-extension-framework/GUID-jdbc_pxf.html#about-using-named-queries-6
- VMware-Tanzu-Greenplum-Platform-Extension-Framework/6.3/tanzu-greenplum-platform-extension-framework/GUID-jdbc_pxf_named.html#configure-the-named-query-1
Point 1) I understand as general information about, e.g., required consistency of the names, rules of how the reading of source site works, etc.
Point 2) points to the exact rule of how the mechanism regulating the order of the data read from the source side works. Here, I would like to correctly understand this rule of mechanism, namely: a) this section is labelled as an example of JDBC, particularly for the PostgreSQL technology. QUESTION: Can you assume that the logic of the mechanism is identical for the JDBC Oracle? b) two subsections (a)"Configure the Named Query" and (b)"Read the Query Results" point out:
- the Section (a)"Configure the Named Query" - describes how the query is built (with clause WHERE) in regard to reading on the source side. We must distribute in all nodes of the PXF cluster (command pxf cluster sync). QUESTION: (illustrated by the example): When the date field is the limitation and the time range (given in hours) of downloading the data is required, then does the <query_name>.sql file have to be swapped and does the pxf cluster sync command have to be run every time the limitation changes? Thread describing the limitation of the data downloaded from the source side to the Greenplum side.
- the Section (b)"Read the Query Results" i) Point 3 with function ..sum(total) from.. - describes reducing the amount of data sent back to Greenplum Database. QUESTION: Point 3 with function ..sum(total) from.. - How does it work and does it work on the source side? ii) Point 4 with clause ... WHERE total > 100 ... - it is said: "In this example, PXF will add the WHERE filter to the subquery. This filter is pushed to and executed on the remote database system, reducing the amount of data that PXF sends back to Greenplum Database. The GROUP BY aggregation, however, is not pushed to the remote and is performed by Greenplum". QUESTION: How does this clause "... WHERE total > 100 ..." work because I tried this and it does not work on the source side with JDBC Oracle.
My simple testing example: Below, the example shows that using the conditions in a query on the tables of the EXTERNAL type in the SQL client is not transferred via the PXF technology to the external source side via JDBC (which in this case is Oracle). The only possibility which the documentation is pointing out is applying "pxf://query:[file_name]" (in the catalogue "server") as a predefined conditional query with the external JDBC source (Oracle).
CREATE EXTERNAL TABLE gp.test_table_ext (
id text
, ins_date timestamp
, abc text
, def text
... etc
)
LOCATION ('pxf://gp.test_table?PROFILE=Jdbc&SERVER=sabc')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
--Activity performed on the Greenplum-PXF side. [This step, when I attempted the test second time, I added [...WHERE ins_date > '2022-08-08 10:00:00';] to "pxf://query:[file_name]" and use pxf cluster sync - as according to the documentation - and then, it does work]
SELECT
COUNT(*),
MIN(id),
MAX(id)
FROM gp.test_table_ext
WHERE ins_date > '2022-08-08 10:00:00';
--Activity visible on e.g., the Oracle-PXF side
SELECT
COUNT(*),
MIN(id),
MAX(id)
FROM gp.test_table_ext;
Conclusions and questions accumulated from the above passage: QUESTION: Can you assume that the logic of the mechanism is identical for the JDBC Oracle? QUESTION: (illustrated by the example): When the date field is the limitation and the time range (given in hours) of downloading the data is required, then does the <query_name>.sql file have to be swapped and does the pxf cluster sync command have to be run every time the limitation changes? QUESTION: Point 3 with function ..sum(total) from.. - How does it work and does it work on the source side? QUESTION: How does this clause "... WHERE total > 100 ..." work because I tried this and it does not work on the source side with JDBC Oracle.
Additional questions:
QUESTION: When I have a limitation by the WHERE clause in query_name and I want to read all the data from query_name, do I have to use the WHERE clause for the second time?
QUESTION: When I have the WHERE clause in query_name and I want to read all the data from query_name, do I need to use the WHERE clause for the second time and if so, is there a subtle difference if I use it in the Greenplum external table or in the temporary external table?
QUESTION: How does the PXF mechanism work (here, I think about (a) "In this example, PXF will add the WHERE filter to the subquery. This filter is pushed to and executed on the remote database system, reducing the amount of data that PXF sends back to Greenplum Database." vs (b) "The GROUP BY aggregation, however, is not pushed to the remote and is performed by Greenplum")?
(subsidiary) QUESTION: When does the WHERE clause (query_name or in query in Greenplum) work outside and when inside of Greenplum?
(subsidiary) QUESTION: Does the WHERE clause used in a query in Greenplum limit the data downloaded to Greenplum (looking at the documentation, I understand that the subquery limits the downloaded data, but logically thinking, it should not work that way)?
QUESTION: Can you assume that the logic of the mechanism is identical for the JDBC Oracle?
Yes. While the section is an example for Postgres, the logic is the same for all JDBC drivers.
QUESTION: (illustrated by the example): When the date field is the limitation and the time range (given in hours) of downloading the data is required, then does the <query_name>.sql file have to be swapped and does the pxf cluster sync command have to be run every time the limitation changes? Thread describing the limitation of the data downloaded from the source side to the Greenplum side.
PXF does not support passing parameters into the named query. If you need to change the query (for example to change a predicate in a WHERE clause), then you will need to edit the file containing the named query and re-sync PXF's configuration. If a WHERE
clause is dynamic, it is better to not encode it inside the named query, but expose the column and take advantage of predicate push down.
PXF does support predicate push down when using a named query by using the provided named query as a sub-query. Building on the same example as in Example: Using a Named Query with PostgreSQL
SELECT * FROM pxf_queryres_frompg ORDER BY city, total;
-- name | city | total | month
-- ------+--------+-------+-------
-- Matt | Aurora | 120 | 11
-- Tom | Denver | 34 | 7
-- Kim | Denver | 678 | 12
-- (3 rows)
SELECT * FROM pxf_queryres_frompg WHERE city = 'Aurora' ORDER BY city, total;
-- name | city | total | month
-- ------+--------+-------+-------
-- Matt | Aurora | 120 | 11
-- (1 row)
The query that gets executed on the source database looks like the following:
SELECT name, city, total, month FROM (SELECT c.name, c.city, sum(o.amount) AS total, o.month
FROM customers c JOIN orders o ON c.id = o.customer_id
WHERE c.state = 'CO'
GROUP BY c.name, c.city, o.month
) pxfsubquery WHERE city = 'Aurora'
If a dynamic condition must be encoded inside the named query, you can have multiple named queries with different values for the encoded condition to avoid re-syncing the configuration by using corresponding external tables, one per named query. For example, if pg_order_report_state_CO.sql
contains
SELECT c.name, c.city, sum(o.amount) AS total, o.month
FROM customers c JOIN orders o ON c.id = o.customer_id
WHERE c.state = 'CO'
GROUP BY c.name, c.city, o.month
and if pg_order_report_state_MT.sql
SELECT c.name, c.city, sum(o.amount) AS total, o.month
FROM customers c JOIN orders o ON c.id = o.customer_id
WHERE c.state = 'MT'
GROUP BY c.name, c.city, o.month
then you can define and query two different external tables refer to the different named queries
CREATE EXTERNAL TABLE pxf_queryres_frompg_state_co(name text, city text, total int, month int)
LOCATION ('pxf://query:pg_order_report_state_CO?PROFILE=jdbc&SERVER=pgsrvcfg&PARTITION_BY=month:int&RANGE=1:13&INTERVAL=3')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
SELECT * FROM pxf_queryres_frompg_state_co;
-- name | city | total | month
-- ------+--------+-------+-------
-- Matt | Aurora | 120 | 11
-- Kim | Denver | 678 | 12
-- Tom | Denver | 34 | 7
-- (3 rows)
CREATE EXTERNAL TABLE pxf_queryres_frompg_state_MT(name text, city text, total int, month int)
LOCATION ('pxf://query:pg_order_report_state_MT?PROFILE=jdbc&SERVER=pgsrvcfg&PARTITION_BY=month:int&RANGE=1:13&INTERVAL=3')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
SELECT * FROM pxf_queryres_frompg_state_mt;
-- name | city | total | month
-- ------+----------+-------+-------
-- Bill | Helena | 12 | 12
-- Kate | Helena | 456 | 111
-- Erik | Missoula | 12 | 9
-- (3 rows)
QUESTION: Point 3 with function ..sum(total) from.. - How does it work and does it work on the source side?
PXF does not support aggregate push down (outside of the JDBC profile's named query), so the whole column is retrieved by PXF and passed to GPDB which preforms the aggregate. You can see this if you inspect the query plan for the query
EXPLAIN SELECT city, sum(total) FROM pxf_queryres_frompg GROUP BY city;
-- QUERY PLAN
-- --------------------------------------------------------------------------------------------------------------
-- Gather Motion 3:1 (slice2; segments: 3) (cost=0.00..490.55 rows=1000 width=16)
-- -> HashAggregate (cost=0.00..490.50 rows=334 width=16)
-- Group Key: city
-- -> Redistribute Motion 3:3 (slice1; segments: 3) (cost=0.00..490.45 rows=334 width=16)
-- Hash Key: city
-- -> Result (cost=0.00..490.43 rows=334 width=16)
-- -> HashAggregate (cost=0.00..490.43 rows=334 width=16)
-- Group Key: city
-- -> External Scan on pxf_queryres_frompg (cost=0.00..440.90 rows=333334 width=12)
-- Optimizer: Pivotal Optimizer (GPORCA)
-- (10 rows)
The query that gets executed on the source database looks like the following:
SELECT city, total FROM (SELECT c.name, c.city, sum(o.amount) AS total, o.month
FROM customers c JOIN orders o ON c.id = o.customer_id
WHERE c.state = 'CO'
GROUP BY c.name, c.city, o.month
) pxfsubquery
QUESTION: How does this clause "... WHERE total > 100 ..." work because I tried this and it does not work on the source side with JDBC Oracle.
PXF builds the following SQL query for execution on the source database; here the given named query is included as a sub-query and the predicate WHERE total > 100
is appended to the outer query.
SELECT city, total FROM (SELECT c.name, c.city, sum(o.amount) AS total, o.month
FROM customers c JOIN orders o ON c.id = o.customer_id
WHERE c.state = 'CO'
GROUP BY c.name, c.city, o.month
) pxfsubquery WHERE total > 100
My simple testing example: Below, the example shows that using the conditions in a query on the tables of the EXTERNAL type in the SQL client is not transferred via the PXF technology to the external source side via JDBC (which in this case is Oracle). The only possibility which the documentation is pointing out is applying "pxf://query:[file_name]" (in the catalogue "server") as a predefined conditional query with the external JDBC source (Oracle).
CREATE EXTERNAL TABLE gp.test_table_ext ( id text , ins_date timestamp , abc text , def text ... etc ) LOCATION ('pxf://gp.test_table?PROFILE=Jdbc&SERVER=sabc') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); --Activity performed on the Greenplum-PXF side. [This step, when I attempted the test second time, I added [...WHERE ins_date > '2022-08-08 10:00:00';] to "pxf://query:[file_name]" and use pxf cluster sync - as according to the documentation - and then, it does work]
SELECT COUNT(*), MIN(id), MAX(id) FROM gp.test_table_ext WHERE ins_date > '2022-08-08 10:00:00'; --Activity visible on e.g., the Oracle-PXF side
SELECT COUNT(*), MIN(id), MAX(id) FROM gp.test_table_ext;
I recreated your example with Postgres as my external database:
On the Postgres side:
-- create a source table on Postgres & populate with some synthetic data
CREATE TABLE test_table (id serial, ins_date timestamp default current_timestamp, value text);
INSERT INTO test_table(value) VALUES ('abc'), ('def'), ('ghi');
On the Greenplum side:
CREATE EXTERNAL TABLE pxf_test_table (id int, ins_date timestamp, value text) LOCATION ('pxf://test_table?PROFILE=jdbc') FORMAT 'CUSTOM' (formatter=pxfwritable_import);
SELECT COUNT(*), MIN(id), MAX(id) FROM pxf_test_table WHERE ins_date > '2022-08-08 10:00:00';
-- count | min | max
-- -------+-----+-----
-- 3 | 1 | 3
-- (1 row)
The query sent to and executed on the Postgres server was
SELECT id, ins_date FROM test_table WHERE ins_date > '2022-08-08 10:00:00'
PXF should be pushing down any aggregates given in the query to the source database.
QUESTION: When I have a limitation by the WHERE clause in query_name and I want to read all the data from query_name, do I have to use the WHERE clause for the second time?
No, you do not need to repeat any WHERE
clauses from the name query file when querying the table. If you execute SELECT * FROM <PXF-external-table>
then all of the data that is selected by the query in the named query file will be returned.
QUESTION: When I have the WHERE clause in query_name and I want to read all the data from query_name, do I need to use the WHERE clause for the second time and if so, is there a subtle difference if I use it in the Greenplum external table or in the temporary external table?
There is no difference in behavior of PXF when using an external table or a temporary external table. From VMware Tanzu Greenplum - CREATE EXTERNAL TABLE
Temporary external tables exist in a special schema; you cannot specify a schema name when you create the table. Temporary external tables are automatically dropped at the end of a session.
QUESTION: How does the PXF mechanism work (here, I think about (a) "In this example, PXF will add the WHERE filter to the subquery. This filter is pushed to and executed on the remote database system, reducing the amount of data that PXF sends back to Greenplum Database." vs (b) "The GROUP BY aggregation, however, is not pushed to the remote and is performed by Greenplum")?
Currently PXF only supports column projection and predicate push down; it does not support aggregate push down. If a query against a PXF external table includes an aggregation, that will be performed by Greenplum after PXF retrieves the data.
(subsidiary) QUESTION: When does the WHERE clause (query_name or in query in Greenplum) work outside and when inside of Greenplum?
When a query includes a WHERE
clause, PXF will push the predicate to the external database. This limits the amount of data brought back from the external database. When the data is returned from PXF to Greenplum, Greenplum will also apply the filtering operation but it only sees the data that has already been filtered and satisfies the condition.
(subsidiary) QUESTION: Does the WHERE clause used in a query in Greenplum limit the data downloaded to Greenplum (looking at the documentation, I understand that the subquery limits the downloaded data, but logically thinking, it should not work that way)?
Yes, if a Greenplum query against a PXF external table includes a WHERE
clause then the amount of data retrieved by PXF from the external database should be reduced. The exact amount of the reduction will depend on the selectivity of the filter.