cloudberry icon indicating copy to clipboard operation
cloudberry copied to clipboard

[Bug] Foreign table(COPY FROM) can't skip lines for invalid multi-byte-encoding text

Open gfphoenix78 opened this issue 1 month ago • 0 comments

Apache Cloudberry version

main branch

What happened

create  external web  table t3(a int, b text)
LOCATION ('http://<ip>:<port>/bad_gb.txt')
FORMAT 'TEXT' (DELIMITER ','  NULL '' )  ENCODING 'GB18030'
LOG ERRORS SEGMENT REJECT LIMIT 2;
select * from t3;

output:

gpadmin=# select * from t3;
ERROR:  segment reject limit reached, aborting operation  (seg0 slice1 127.0.1.1:7002 pid=2316762)
DETAIL:  Last error was: invalid byte sequence for encoding "GB18030": 0xa3 0x0a
CONTEXT:  External table t3, line 3 of file http://.../bad_gb.txt

bad_gb.txt: encoding GB18030

gpadmin@hashdata:/tmp/www$ hexdump -C bad_gb.txt
00000000  31 2c ca c0 bd e7 0a 32  2c c4 e3 ba c3 c2 f0 a3  |1,.....2,.......|
00000010  0a 33 2c 6e 69 68 61 6f  0a                                      |.3,nihao.|
00000019

What you think should happen instead

Only the second line is bad, the first and third line should output according to its definition.

How to reproduce

repro, replace the

create  external web  table t3(a int, b text)
LOCATION ('http://<ip>:<port>/bad_gb.txt')
FORMAT 'TEXT' (DELIMITER ','  NULL '' )  ENCODING 'GB18030'
LOG ERRORS SEGMENT REJECT LIMIT 2;
select * from t3;


-- or
create temp table t0(a int, b text);
-- copy the file bad_gb.txt to /tmp
copy t0 from '/tmp/www/bad_gb.txt' with(encoding 'gb18030') log errors segment reject limit 2;

output:

gpadmin=# select * from t3;
ERROR:  segment reject limit reached, aborting operation  (seg0 slice1 127.0.1.1:7002 pid=2316762)
DETAIL:  Last error was: invalid byte sequence for encoding "GB18030": 0xa3 0x0a
CONTEXT:  External table t3, line 3 of file http://.../bad_gb.txt

-- or

gpadmin=# copy t0 from '/tmp/www/bad_gb.txt' with(encoding 'gb18030') log errors segment reject limit 2;
ERROR:  segment reject limit reached, aborting operation
DETAIL:  Last error was: invalid byte sequence for encoding "GB18030": 0xa3 0x0a, column a
CONTEXT:  COPY t0, line 2, column a: "1,世界"

bad_gb.txt: encoding GB18030

gpadmin@hashdata:/tmp/www$ hexdump -C bad_gb.txt
00000000  31 2c ca c0 bd e7 0a 32  2c c4 e3 ba c3 c2 f0 a3  |1,.....2,.......|
00000010  0a 33 2c 6e 69 68 61 6f  0a                       |.3,nihao.|
00000019

Operating System

ubuntu 22.04

Anything else

No response

Are you willing to submit PR?

  • [ ] Yes, I am willing to submit a PR!

Code of Conduct

gfphoenix78 avatar Nov 06 '25 03:11 gfphoenix78