databend
databend copied to clipboard
bug: faild to load gz tsv files into databend
Summary
CREATE OR REPLACE TABLE hackernews_1m
(
`id` BIGINT,
`deleted` TINYINT,
`type` String,
`author` String,
`timestamp` DateTime,
`comment` String,
`dead` TINYINT,
`parent` BIGINT,
`poll` BIGINT,
`children` Array(BIGINT),
`url` String,
`score` INT,
`title` String,
`parts` Array(INT),
`descendants` BIGINT
);
wget https://doris-build-1308700295.cos.ap-beijing.myqcloud.com/regression/index/hacknernews_1m.csv.gz
❯ curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m FILE_FORMAT = (type = TSV) ' -F 'upload=@"./hacknernews_1m.csv.gz"'
{"error":{"code":"400","message":"execute fail: Invalid value '\u001f�\b\b�\u000b�c\u0000\u0003hacknernews_1m.csv\u0000��K��:�%\b�o�\"\u0006p�L|�>��j�U3+;9K\u0002�\u0003[\u000b�GE�f�\u001d�$�\u0004A<S�OO�\u0013�������������' for column 0 (id Int64 NULL): invalid text for number\nat file 'hacknernews_1m.csv.gz', line 0"}}%
# works
❯ curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m FILE_FORMAT = (type = TSV) ' -F 'upload=@"./hacknernews_1m.csv"'
{"id":"00d5bae1-a3f8-4ce4-8dc2-c72998456002","state":"SUCCESS","stats":{"rows":2000000,"bytes":642061025},"error":null,"files":["hacknernews_1m.csv"]}%
use (type = TSV compression=gzip)
like in copy
curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m FILE_FORMAT = (type = TSV compression=gzip) ' -F 'upload=@"./hacknernews_1m.csv.gz"'
{"id":"ed17ccdc-2ea2-4fac-9ff9-97b0f61fd487","state":"SUCCESS","stats":{"rows":1000000,"bytes":130618406},"error":null,"files":["hacknernews_1m.csv.gz"]}%
There are two issues.
- wrong stats
stats":{"rows":2000000,"bytes":642061025}
let compression = input_context
.get_compression_alg(&filename)
.map_err(BadRequest)?;
compression is wrong.
default compression is none not auto
stats is right on my mac.
the bytes
you get is diff too, may be you have a diff hacknernews_1m.csv?
(venv) ➜ test git:(stage2) ls -lh hacknernews_1m.csv*
-rw-r--r-- 1 yangxiufeng staff 342M Sep 3 2023 hacknernews_1m.csv
-rw-r--r-- 1 yangxiufeng staff 125M Apr 30 16:42 hacknernews_1m.csv.gz
(venv) ➜ test git:(stage2) curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m FILE_FORMAT = (type = TSV compression=gzip) ' -F 'upload=@"./hacknernews_1m.csv.gz"'
{"id":"c0766b8a-77c5-477c-9a48-01e09a895ad1","state":"SUCCESS","stats":{"rows":1000000,"bytes":130618406},"error":null,"files":["hacknernews_1m.csv.gz"]}%
(venv) ➜ test git:(stage2) curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m FILE_FORMAT = (type = TSV compression=none) ' -F 'upload=@"./hacknernews_1m.csv"'
{"id":"08b3d058-13ab-4303-8925-e40ac971a2dd","state":"SUCCESS","stats":{"rows":1000000,"bytes":358551613},"error":null,"files":["hacknernews_1m.csv"]}% (venv) ➜ test git:(stage2) curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m FILE_FORMAT = (type = TSV compression=none) ' -F 'upload=@"./hacknernews_1m.csv"'
{"id":"6b5a826b-ec49-47e0-a2ba-75bf372dd36f","state":"SUCCESS","stats":{"rows":1000000,"bytes":358551613},"error":null,"files":["hacknernews_1m.csv"]}%
❯ curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m FILE_FORMAT = (type = TSV) ' -F 'upload=@"./hacknernews_1m.csv"'
{"id":"546943ab-f100-4a39-aee8-adca6ca27596","state":"SUCCESS","stats":{"rows":2000000,"bytes":642061026},"error":null,"files":["hacknernews_1m.csv"]}%
❯ wc -l ./hacknernews_1m.csv
1000000 ./hacknernews_1m.csv
❯ ls -lh hacknernews_1m.csv*
-rw-r--r-- 1 sundy sundy 342M Sep 3 2023 hacknernews_1m.csv
Hi @sundy-li,
I am interested to work on this issue, if it is open, Could you please assign it to me ?
Hi @sundy-li,
I need your guidance on an issue with the streaming_load
feature. In the latest codebase, it's not enabled by default, and the enable_streaming_load
flag needs to be set.
I can successfully stream both hackernews_1m.csv
and hackernews_1m.csv.gz
files. Could you advise on what else needs to be checked to close this issue?
Working Context ::
curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' \
-H 'insert_sql: insert into hackernews_1m FILE_FORMAT = (type = TSV)' \
-H 'enable_streaming_load: 1' \
-F 'upload=@"./hacknernews_1m.csv"'
{"id":"10ba0a4e-3fcf-4f0a-b66a-0abca31c192b","state":"SUCCESS","stats":{"rows":1000000,"bytes":358551613},"error":null,"files":["hacknernews_1m.csv"]}%
curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' \
-H 'insert_sql: insert into hackernews_1m FILE_FORMAT = (type = TSV compression=gzip)' \
-H 'enable_streaming_load: 1' \
-F 'upload=@"./bup_hacknernews_1m.csv.gz"'
{"id":"e2cc39e6-a59b-4c96-b2fb-3b1d57a15c36","state":"SUCCESS","stats":{"rows":1000000,"bytes":130618406},"error":null,"files":["bup_hacknernews_1m.csv.gz"]}%
Error Context ::
» curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' \
-H 'insert_sql: insert into hackernews_1m FILE_FORMAT = (type = TSV)' \
-F 'upload="@./hackernews_1m.csv"'
{"error":{"code":"405","message":"Streaming load will be removed soon. In this version, if you still need it, set enable_streaming_load=1 in HTTP header."}}%
» curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' \
-H 'insert_sql: insert into hackernews_1m FILE_FORMAT = (type = TSV compression=gzip)' \
-F 'upload=@"./hacknernews_1m.csv.gz"'
{"error":{"code":"405","message":"Streaming load will be removed soon. In this version, if you still need it, set enable_streaming_load=1 in HTTP header."}}%
since streaming_load
may be deprecated soon, this should not be an issue anymore.