gpdb
gpdb copied to clipboard
HEADER option for gpfdist doesn't take effect on GP6.21.0
Bug Report
Greenplum version or build
Database 6.21.0 build commit:d0087e3b24c54d203ca8bb315559205f13cd6393 Open Source
OS version and uname -a
autoconf options used ( config.status --config )
Installation information ( pg_config )
Expected behavior
Actual behavior
Step to reproduce the behavior
Hi Guys:
We just found an issue about EXTERNAL table with gpfdist using greenplum 6.21.0
testdb=# select version(); version
PostgreSQL 9.4.26 (Greenplum Database 6.21.0 build commit:d0087e3b24c54d203ca8bb315559205f13cd6393 Open Source) on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-bit compiled on Jun 10 2022 01:57:17 (1 row)
We create the external table with HEADER option so that we can ignore the 1st row of the external file, but the result is not as expected, the table will still read the 1st row of the file.
The logs are as here:
[XXXXXX ~]$ cat test.csv AAA|BBB|CCC aaa|bbb|ccc eee|fff|ggg
testdb=# create readable external table test(STARTTIME varchar,PROVINCE varchar,PROVINCE_ID varchar) location ('gpfdist://192.168.88.23:8001/XXX/test.csv') FORMAT 'TEXT' (HEADER DELIMITER '|' ESCAPE 'OFF' ) ; NOTICE: HEADER means that each one of the data files has a header row CREATE EXTERNAL TABLE testdb=# select * from test; NOTICE: External scan from gpfdist(s) server will utilize 64 out of 144 segment databases NOTICE: HEADER means that each one of the data files has a header row starttime | province | province_id -----------+----------+------------- AAA | BBB | CCC aaa | bbb | ccc eee | fff | ggg (3 rows)
This is the first time we use greenplum6,we are not sure where other version of gp6 has the same issue.
Can you help us out of this ? Thanks a million.
PG's header is only used in csv mode.
GPDB doc:
HEADER Specifies that a file contains a header line with the names of each column in the file. On output, the first line contains the column names from the table, and on input, the first line is ignored. This option is allowed only when using CSV format.
Though GPDB have enabled it in text mode In certain cases, users shouldn't rely on it.
From your sql :
create readable external table test(STARTTIME varchar,PROVINCE varchar,PROVINCE_ID varchar)
location
('gpfdist://192.168.88.23:8001/XXX/test.csv')
FORMAT 'TEXT' (HEADER DELIMITER '|' ESCAPE 'OFF' ) ;
It tries to read a csv file but format is text(FORMAT 'TEXT').
Modify it to csv and have a try.
good catch @avamingli . it should report error (or warning at least) if it doesn't have any affect. we need to improve it.
Should we report an error or warning when creating external table or loading from gpfdist?
Hi, my friend @hongan-base . I have repeated your action according to your comment. But I don't get your result. Indeed I get the expected result in my test environment(gpdb 6.21). Following is the file content and DDL for the external table.
AAA | BBB | CCC
aaa | bbb | ccc
eee | fff | ggg
location
('gpfdist://127.0.0.1:7070/test.csv')
FORMAT 'text' (HEADER DELIMITER '|' ESCAPE 'OFF' );
And my result is:
select * from test ;
NOTICE: HEADER means that each one of the data files has a header row
starttime | province | province_id
-----------+----------+-------------
aaa | bbb | ccc
eee | fff | ggg
Thus, I'm wondering where you got the binary file of gpdb running in your environment. The current official gpdb can't repeat your problem. Maybe it is helpful for you to replace your gpdb with an officially released gpdb. @hongan-base
What's more, @avamingli, I found that gpdb doesn't support ESCAPE being 'off', if you change FORMAT from 'TEXT' to 'CSV'.