pxf icon indicating copy to clipboard operation
pxf copied to clipboard

failed to load external pxf jdbc table

Open wangsenyuan opened this issue 1 year ago • 14 comments

We have a pxf jdbc table to mysql source, it constantly reports following error when querying it;

` [XX000] ERROR: transfer error (18): Transferred a partial file from '127.0.0.1:5888' (libchurl.c:929) (seg6 slice1 x.x.x.x:6000 pid=13301)

`

The external table is defined like following

` create external table test_table ( id_ text,

    tenant_id_ text
    )
location ('pxf://test_table?PROFILE=Jdbc&SERVER=mysql-server')
    on all
format 'custom' (formatter = 'pxfwritable_import')
encoding = 6;

`

and the querying sql is ` insert into pg_test_table select id_ , tenant_id_ from test_table ORDER by id_;

`

wangsenyuan avatar Mar 17 '23 06:03 wangsenyuan

In order for us to provide any help, we will need more information. Which version of PXF are you using? Which version of GPDB? Which Linux distribution?

bradfordb-vmware avatar Apr 03 '23 16:04 bradfordb-vmware

Hi,

I get the same error when using Trino JDBC

ERROR: transfer error (18): Transferred a partial file from '127.0.0.1:5888' (libchurl.c:929) (seg42 slice1 10.10.114.138:6018 pid=763397) (libchurl.c:929) DETAIL: curl error buffer: transfer closed with outstanding read data remaining CONTEXT: External table v_ext_table

Greenplum Database 6.19.4 PXF version 6.5.0 NAME="Red Hat Enterprise Linux Server" VERSION="7.6 (Maipo)"

Maaciekk avatar Oct 26 '23 09:10 Maaciekk

@Maaciekk Can you try upgrading to PXF 6.7.0? This release includes an upgraded to Spring Boot and Tomcat that might help.

bradfordb-vmware avatar Oct 26 '23 16:10 bradfordb-vmware

@bradfordb-vmware Hi, I updated PXF to version 6.8.0 and sometimes the query failed with the same error:

transfer error (18): Partial file transferred with '127.0.0.1:5888' (libchurl.c:930) (seg53 slice1 10.10.114.139:6005 pid=351281

I have access to the Trino console and I get an error:

io.trino.spi.TrinoException: Query 20231220_082134_13175_aurb7 was abandoned by the client because it may have terminated or has stopped checking query results. Query results were not available since 2023-12-20T09:27:56.877+01:00: currentTime 2023-12-20T09:32:57.823+01:00

Then the ETL process restarts and the query sometimes completes correctly and sometimes restarts several times. The query is a standard SELECT * with division into days in the LOCATION definition - PARTITION_BY=cast(dt as date):date&RANGE=2023-10-01:2023-11-01&INTERVAL=1:day Perhaps this will help with the diagnosis.

@EDIT Could this be correlated with the pxf.task.pool.core-size=96 or pxf.task.pool.queue-capacity=480 parameter? I noticed that these problems mainly occur at night when a lot of PXF-related processing is running.

Maaciekk avatar Dec 20 '23 08:12 Maaciekk

@bradfordb-vmware

Another common error is:

PXF server error : Query failed (#20240104_090926_26763_aurb7): Current transaction is aborted, commands ignored until end of transaction block (seg6 slice1 10.10.114.137:6006 pid=610476)

PXF server error : Error executing query: java.net.SocketException: Socket closed (seg57 slice1 [10.10.114.139:6009](http://10.10.114.139:6009/) pid=506090)

I noticed that problems arise when multiple ETL processes read data via PXF over JDBC to Trino with PARTITION_BY data partitioning. When processes run separately, errors occur sporadically. Could this indicate some problems with multithreading?

Maaciekk avatar Jan 04 '24 09:01 Maaciekk

@Maaciekk Thank you for the information. We have not been able to reproduce the issue locally so we're not sure what the underlying cause is. Can you share what your external table DDL is? Can you share details about the topology of your GPDB cluster (number of hosts, number of segments per host)? Can you share a little bit more about what the number of ETL processes when you observe increased error rates (is it a couple of ETL processes, tens of processes, hundreds)? What is the query that is being run? Would it be possible to provide a sample of GPDB and PXF logs for a failing ETL job and for a successful ETL job?

bradfordb-vmware avatar Jan 05 '24 21:01 bradfordb-vmware

Hello!

We have a similar case in our test environment: Greenplum Database 6.26.0 PXF version 6.9.0 CentOS Linux release 7.9.2009 (Core) 1 Master 2 Segments hosts with 4 logical segments on each

Sometimes the error is the same, and sometimes the error is different. In both cases, the errors are about closed connections. We noticed that errors are connected with queries that contain LIMIT statement. We tested it via PXF over JDBC to Postgres.

Example of the errors:

# Example 1
transfer error (18): Transferred a partial file from '127.0.0.1:5888' (libchurl.c:930) 

# Example 2
PXF server error(0) : Recv failure: Connection reset by peer  (seg6 slice1 10.92.40.241:10002 pid=21476)

There is no simple way to reproduce, but it is possible. Below is a scenario of how to reproduce:

  1. Install pgbench to run tests. We install and run it on the master server: yum install -y postgresql-contrib-9.2.24

  2. Prepare DDL:

DROP TABLE IF EXISTS source_table;
CREATE TABLE source_table (id integer, name text, body text, type  varchar(1000), format varchar(1000), job_id integer) DISTRIBUTED BY (id);
INSERT INTO source_table SELECT gen, 'name' || gen::text, 'body' || gen::text, 'type' || gen::text, 'format' || gen::text, gen from generate_series(1, 1000000) gen;

DROP TABLE IF EXISTS dist_table;
CREATE TABLE dist_table (LIKE source_table)
WITH (appendonly=true, orientation=column, compresstype=zstd, compresslevel=3) DISTRIBUTED BY (id);

DROP EXTERNAL TABLE IF EXISTS ext_read_table;
CREATE readable EXTERNAL TABLE ext_read_table (LIKE source_table)
LOCATION ('pxf://source_table?PROFILE=JDBC&JDBC_DRIVER=org.postgresql.Driver&DB_URL=jdbc:postgresql://mdw:5432/postgres') 
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
  1. Run tests:
echo "INSERT INTO dist_table SELECT * FROM ext_read_table limit 1;" > test_with_limit.sql
pgbench -d postgres -c 10 -j 10 -T 600 -f test_with_limit.sql -n

It might take some time to reproduce the error, but usually 5 minutes are enough.
Attached are logs from the master server and the segment (logs from GP and PXF). The master log contains an error:

2024-01-11 08:48:35.368387 UTC,"gpadmin","postgres",p14905,th1821874304,"[local]",,2024-01-11 08:48:17 UTC,170554,con241,cmd78,seg-1,,dx47994,x170554,sx1,"ERROR","XX000","transfer error (18): Transferred a partial file from '127.0.0.1:5888' (libchurl.c:930)  (seg5 slice1 10.92.40.241:10001 pid=21415) (libchurl.c:930)","curl error buffer: transfer closed with outstanding read data remaining",,,,"External table ext_read_table","INSERT INTO dist_table SELECT * FROM ext_read_table limit 1;",0,,"libchurl.c",930,"Stack trace:

In the pxf-service.log the lines connected with the error are searchable by 1704959928-0000047994:default:5

We also checked whether the error would be when the queries don't have a statement LIMIT. We ran the test on the same environment and we didn't receive any errors in that case. The scenario to check:

  1. Prepare DDL:
DROP TABLE IF EXISTS source_table_small;
CREATE TABLE source_table_small (id integer, name text, body text, type  varchar(1000), format varchar(1000), job_id integer) DISTRIBUTED BY (id);
INSERT INTO source_table_small SELECT gen, 'name' || gen::text, 'body' || gen::text, 'type' || gen::text, 'format' || gen::text, gen from generate_series(1, 1) gen;

DROP TABLE IF EXISTS dist_table_small;
CREATE TABLE dist_table_small (LIKE source_table_small)
WITH (appendonly=true, orientation=column, compresstype=zstd, compresslevel=3) DISTRIBUTED BY (id);

DROP EXTERNAL TABLE IF EXISTS ext_read_table_small;
CREATE readable EXTERNAL TABLE ext_read_table_small (LIKE source_table_small)
LOCATION ('pxf://source_table_small?PROFILE=JDBC&JDBC_DRIVER=org.postgresql.Driver&DB_URL=jdbc:postgresql://mdw:5432/postgres') 
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
  1. Run tests:
echo "INSERT INTO dist_table_small SELECT * FROM ext_read_table_small;" > test_no_limit.sql
pgbench -d postgres -c 10 -j 10 -T 600 -f test_no_limit.sql -n

segment_log.csv master_log.csv pxf-service.log

RomaZe avatar Jan 11 '24 10:01 RomaZe

I noticed that reducing FETCH_SIZE to 10,000 from 1,000,000 eliminated the problem, at least with Trino.

Maaciekk avatar Jan 11 '24 13:01 Maaciekk

@RomaZe Thank you for sharing your reproduction. I just tested this in my dev environment and I was able to reproduce the error. I found that the error reproduces very quickly if I run this right after restarting PXF. I'll share this with the rest of the team and dig into this some more.

bradfordb-vmware avatar Jan 12 '24 02:01 bradfordb-vmware

Hi, @bradfordb-vmware. Any updates about this issue? We have faced the same error on pxf 6.8...

avkashin avatar Mar 26 '24 13:03 avkashin

@avkashin Hello! Look at this PR: https://github.com/greenplum-db/pxf/pull/1105

RomaZe avatar Mar 27 '24 08:03 RomaZe

@RomaZe, hello! Thank you, appreciate!

avkashin avatar Mar 27 '24 08:03 avkashin

@wangsenyuan @RomaZe @Maaciekk - PXF release 6.10.1 included a fix for this issue. Please see if it works for you now and close the issue if so.

Please note that you will still see partial transfer error if there was an actual problem when PXF already has started streaming the data. However, there should no longer be problems when running regular queries (with LIMIT and without LIMIT) that do not fail otherwise.

denalex avatar Apr 08 '24 22:04 denalex

@denalex, issue still exists while load parquet files from s3.

avkashin avatar Apr 09 '24 10:04 avkashin