gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

HEADER option for gpfdist doesn't take effect on GP6.21.0

Open hongan-base opened this issue 3 years ago • 4 comments

Bug Report

Greenplum version or build

Database 6.21.0 build commit:d0087e3b24c54d203ca8bb315559205f13cd6393 Open Source

OS version and uname -a

autoconf options used ( config.status --config )

Installation information ( pg_config )

Expected behavior

Actual behavior

Step to reproduce the behavior

Hi Guys:

We just found an issue about EXTERNAL table with gpfdist using greenplum 6.21.0

testdb=# select version(); version

PostgreSQL 9.4.26 (Greenplum Database 6.21.0 build commit:d0087e3b24c54d203ca8bb315559205f13cd6393 Open Source) on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-bit compiled on Jun 10 2022 01:57:17 (1 row)

We create the external table with HEADER option so that we can ignore the 1st row of the external file, but the result is not as expected, the table will still read the 1st row of the file.

The logs are as here:

[XXXXXX ~]$ cat test.csv AAA|BBB|CCC aaa|bbb|ccc eee|fff|ggg

testdb=# create readable external table test(STARTTIME varchar,PROVINCE varchar,PROVINCE_ID varchar) location ('gpfdist://192.168.88.23:8001/XXX/test.csv') FORMAT 'TEXT' (HEADER DELIMITER '|' ESCAPE 'OFF' ) ; NOTICE: HEADER means that each one of the data files has a header row CREATE EXTERNAL TABLE testdb=# select * from test; NOTICE: External scan from gpfdist(s) server will utilize 64 out of 144 segment databases NOTICE: HEADER means that each one of the data files has a header row starttime | province | province_id -----------+----------+------------- AAA | BBB | CCC aaa | bbb | ccc eee | fff | ggg (3 rows)

This is the first time we use greenplum6,we are not sure where other version of gp6 has the same issue.

Can you help us out of this ? Thanks a million.

hongan-base avatar Jul 22 '22 02:07 hongan-base

PG's header is only used in csv mode.

GPDB doc:

HEADER Specifies that a file contains a header line with the names of each column in the file. On output, the first line contains the column names from the table, and on input, the first line is ignored. This option is allowed only when using CSV format.

Though GPDB have enabled it in text mode In certain cases, users shouldn't rely on it.

From your sql :


create readable external table test(STARTTIME varchar,PROVINCE varchar,PROVINCE_ID varchar)
location
('gpfdist://192.168.88.23:8001/XXX/test.csv')
FORMAT 'TEXT' (HEADER DELIMITER '|' ESCAPE 'OFF' ) ;

It tries to read a csv file but format is text(FORMAT 'TEXT').

Modify it to csv and have a try.

avamingli avatar Jul 29 '22 10:07 avamingli

good catch @avamingli . it should report error (or warning at least) if it doesn't have any affect. we need to improve it.

lij55 avatar Aug 01 '22 23:08 lij55

Should we report an error or warning when creating external table or loading from gpfdist?

xiaoxiaoHe-E avatar Aug 03 '22 06:08 xiaoxiaoHe-E

Hi, my friend @hongan-base . I have repeated your action according to your comment. But I don't get your result. Indeed I get the expected result in my test environment(gpdb 6.21). Following is the file content and DDL for the external table.

AAA | BBB | CCC
aaa | bbb | ccc
eee | fff | ggg
location
('gpfdist://127.0.0.1:7070/test.csv')
FORMAT 'text' (HEADER DELIMITER '|' ESCAPE 'OFF' );

And my result is:

select * from test ;
NOTICE:  HEADER means that each one of the data files has a header row
 starttime | province | province_id
-----------+----------+-------------
 aaa       |  bbb     |  ccc
 eee       |  fff     |  ggg

Thus, I'm wondering where you got the binary file of gpdb running in your environment. The current official gpdb can't repeat your problem. Maybe it is helpful for you to replace your gpdb with an officially released gpdb. @hongan-base

What's more, @avamingli, I found that gpdb doesn't support ESCAPE being 'off', if you change FORMAT from 'TEXT' to 'CSV'.

Hlinbit avatar Sep 14 '22 07:09 Hlinbit