[Bug] data queried from external table defined with error limit unexpected change
Cloudberry Database version
PostgreSQL 14.4 (Cloudberry Database 1.0.0+ed64982 build commit:ed64982034245e8b46926b75f349a4a5d5b8fd67)
What happened
Testrepo testcase failed:
ErrorLogTests.test_limit ... 644.98 ms ... FAIL
It is related to this PR, https://github.com/cloudberrydb/cloudberrydb/pull/320
DDL
CREATE EXTERNAL TABLE exttab_limit_1( i int, j text )
LOCATION ('gpfdist://localhost:8080/exttab_limit_1.tbl') FORMAT 'TEXT' (DELIMITER '|')
LOG ERRORS SEGMENT REJECT LIMIT 10;
CREATE EXTERNAL TABLE
-- Generate the file with lot of errors
\! python3 /code/cbdb_testrepo_src/mpp/gpdb/tests/queries/basic/exttab/errlog/sql/datagen.py 200 50 > /code/cbdb_testrepo_src/mpp/gpdb/tests/queries/basic/exttab/errlog/data//exttab_limit_2.tbl
-- reaches reject limit, use the same err table
CREATE EXTERNAL TABLE exttab_limit_2( i int, j text )
LOCATION ('gpfdist://localhost:8080/exttab_limit_2.tbl') FORMAT 'TEXT' (DELIMITER '|')
LOG ERRORS SEGMENT REJECT LIMIT 2;
CREATE EXTERNAL TABLE
-- Test: LIMIT queries without segment reject limit reached
-- Note that even though we use exttab_limit_2 here , the LIMIT 3 will not throw a segment reject limit error
-- order 0
Query expected result:
with cte1 as
(
SELECT e1.i, e2.j FROM exttab_limit_1 e1, exttab_limit_1 e2
WHERE e1.i = e2.i LIMIT 5
)
SELECT * FROM cte1, exttab_limit_2 e3 where cte1.i = e3.i LIMIT 3;
psql:/code/cbdb_testrepo_src/mpp/gpdb/tests/queries/basic/exttab/errlog/output/limit_planner.sql:33: NOTICE: found 5 data formatting errors (5 or more input rows), rejected related input data
i | j | i | j
---+----------+---+----------
0 | 0_number | 0 | 0_number
1 | 1_number | 1 | 1_number
5 | 5_number | 5 | 5_number
(3 rows)
Query actual result:
with cte1 as
(
SELECT e1.i, e2.j FROM exttab_limit_1 e1, exttab_limit_1 e2
WHERE e1.i = e2.i LIMIT 5
)
SELECT * FROM cte1, exttab_limit_2 e3 where cte1.i = e3.i LIMIT 3;
psql:/code/cbdb_testrepo_src/mpp/gpdb/tests/queries/basic/exttab/errlog/output/limit_planner.sql:33: NOTICE: found 5 data formatting errors (5 or more input rows), rejected related input data
i | j | i | j
---+----------+---+----------
0 | 0_number | 0 | 0_number
1 | 1_number | 1 | 1_number
4 | 4_number | 4 | 4_numbe
(3 rows)
What you think should happen instead
No response
How to reproduce
goto testrepo folder:
cd /code/cbdb_testrepo_src/
run testrepo testcase to reproduce:
/code/cbdb_testrepo_src/test_framework/tinc.py discover /code/cbdb_testrepo_src/mpp/gpdb/tests/queries/basic/exttab/errlog
Operating System
centos7
Anything else
No response
Are you willing to submit PR?
- [ ] Yes, I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct.
Offhand, I don't think it's a bug.
SELECT * FROM cte1, exttab_limit_2 e3 where cte1.i = e3.i LIMIT 3;
SQL standard doesn't guarantee the results with only LIMIT. The SQL doesn't have ORDER BY, if you have more than 3 rows, the results is uncertain with only LIMIT.
See https://www.postgresql.org/docs/14/queries-limit.html
When using LIMIT, it is important to use an ORDER BY clause that constrains the result rows into a unique order. Otherwise you will get an unpredictable subset of the query's rows.
Could you please check how many valid result rows are, excluding the rejected rows?
Hi @avamingli
1st external table: exttab_limit_1.tbl 10 rows with 2 errors 2nd external table: exttab_limit_2.tbl. 200 rows with 50 errors I will paste original data at bottom.
This case is special, simply adding an "ORDER BY" as following would cause an error.
with cte1 as
(
SELECT e1.i, e2.j FROM exttab_limit_1 e1, exttab_limit_1 e2
WHERE e1.i = e2.i LIMIT 5
)
SELECT * FROM cte1, exttab_limit_2 e3 where cte1.i = e3.i ORDER BY cte1.i DESC LIMIT 3;
psql:/code/cbdb_testrepo_src/mpp/gpdb/tests/queries/basic/exttab/errlog/output/limit_planner.sql:64: ERROR: segment reject limit reached, aborting operation (seg0 slice1 127.0.0.1:7002 pid=7273)
DETAIL: Last error was: invalid input syntax for type integer: "error_1", column i
CONTEXT: External table exttab_limit_2, line 11 of gpfdist://localhost:8081/exttab_limit_2.tbl, column i
I think involving sorting may touch more bad rows because the exttab_limit_2 table is defined with LOG ERRORS SEGMENT REJECT LIMIT 2.
I am not sure whether verifying it is necessary as the data result is not guaranteed.
exttab_limit_1.tbl
0|0_number
1|1_number
2|2_number
error_0
4|4_number
5|5_number
6|6_number
7|7_number
8|8_number
error_1
exttab_limit_2.tbl
0|0_number
1|1_number
error_0
3|3_number
4|4_number
5|5_number
6|6_number
7|7_number
8|8_number
9|9_number
error_1
11|11_number
12|12_number
13|13_number
14|14_number
15|15_number
16|16_number
17|17_number
18|18_number
19|19_number
20|20_number
error_2
22|22_number
error_3
24|24_number
25|25_number
26|26_number
error_4
28|28_number
29|29_number
error_5
31|31_number
error_6
33|33_number
error_7
error_8
error_9
error_10
38|38_number
39|39_number
40|40_number
41|41_number
42|42_number
43|43_number
44|44_number
error_11
46|46_number
error_12
48|48_number
49|49_number
error_13
51|51_number
52|52_number
53|53_number
54|54_number
error_14
error_15
57|57_number
58|58_number
59|59_number
60|60_number
61|61_number
error_16
63|63_number
64|64_number
65|65_number
66|66_number
67|67_number
68|68_number
error_17
70|70_number
71|71_number
72|72_number
error_18
error_19
75|75_number
76|76_number
77|77_number
78|78_number
error_20
80|80_number
81|81_number
82|82_number
83|83_number
84|84_number
85|85_number
error_21
87|87_number
error_22
89|89_number
error_23
91|91_number
92|92_number
error_24
94|94_number
error_25
96|96_number
97|97_number
98|98_number
99|99_number
100|100_number
101|101_number
102|102_number
103|103_number
104|104_number
105|105_number
106|106_number
107|107_number
108|108_number
109|109_number
error_26
111|111_number
error_27
113|113_number
error_28
error_29
116|116_number
error_30
118|118_number
119|119_number
error_31
121|121_number
error_32
123|123_number
124|124_number
125|125_number
126|126_number
127|127_number
128|128_number
error_33
130|130_number
131|131_number
132|132_number
error_34
134|134_number
error_35
136|136_number
137|137_number
138|138_number
139|139_number
140|140_number
141|141_number
142|142_number
143|143_number
error_36
145|145_number
146|146_number
147|147_number
148|148_number
error_37
150|150_number
151|151_number
error_38
error_39
154|154_number
155|155_number
156|156_number
157|157_number
158|158_number
159|159_number
160|160_number
error_40
error_41
163|163_number
error_42
165|165_number
166|166_number
167|167_number
error_43
169|169_number
170|170_number
171|171_number
172|172_number
173|173_number
error_44
175|175_number
176|176_number
177|177_number
178|178_number
179|179_number
180|180_number
181|181_number
182|182_number
183|183_number
184|184_number
185|185_number
error_45
187|187_number
188|188_number
189|189_number
190|190_number
191|191_number
error_46
193|193_number
194|194_number
error_47
196|196_number
error_48
198|198_number
error_49
Anyway, after looking into the case, actual result now makes more sense than before.
i | j | i | j
---+----------+---+----------
0 | 0_number | 0 | 0_number
1 | 1_number | 1 | 1_number
4 | 4_number | 4 | 4_numbe
(3 rows)
The test case is bad. I can add a sort to the query within CTE to make the result more stable. Even though in this way data result is not 100% guaranteed. If it failed again, we can figure out another solution. Please let me know if you have different thoughts. Thanks.
with cte1 as
(
SELECT e1.i, e2.j FROM exttab_limit_1 e1, exttab_limit_1 e2
WHERE e1.i = e2.i ORDER BY e1.i ASC LIMIT 5
)
SELECT * FROM cte1, exttab_limit_2 e3 where cte1.i = e3.i LIMIT 3;
Resolved it by adjusting test case.
ErrorLogTests.test_limit ... 675.91 ms ... ok