doris icon indicating copy to clipboard operation
doris copied to clipboard

[improve](simd-json-reader) fix simd json reader lose data and support stream parser

Open sollhui opened this issue 1 year ago • 63 comments

Proposed changes

When load json with do not set read_json_by_line, only one json loaded. image But there more than one json, means lose data when load json.

Further comments

If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...

sollhui avatar Nov 16 '23 13:11 sollhui

run buildall

sollhui avatar Nov 16 '23 13:11 sollhui

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Nov 16 '23 14:11 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Nov 16 '23 14:11 github-actions[bot]

TeamCity be ut coverage result: Function Coverage: 36.75% (8410/22883) Line Coverage: 29.27% (68395/233666) Region Coverage: 27.86% (35344/126882) Branch Coverage: 24.62% (18064/73366) Coverage Report: http://coverage.selectdb-in.cc/coverage/8b468bcb5170c80a8af43123c05a08d02e27b480_8b468bcb5170c80a8af43123c05a08d02e27b480/report/index.html

doris-robot avatar Nov 16 '23 15:11 doris-robot

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
Tpch sf100 test result on commit 8b468bcb5170c80a8af43123c05a08d02e27b480, data reload: false

run tpch-sf100 query with default conf and session variables
q1	4950	4696	4656	4656
q2	353	161	159	159
q3	2018	1897	1876	1876
q4	1392	1263	1229	1229
q5	3951	3939	4090	3939
q6	260	128	130	128
q7	1394	876	879	876
q8	2781	2787	2763	2763
q9	9698	9648	9630	9630
q10	3473	3506	3540	3506
q11	370	252	248	248
q12	438	295	303	295
q13	4545	3823	3815	3815
q14	336	290	288	288
q15	587	556	521	521
q16	663	581	576	576
q17	1133	962	955	955
q18	7718	7304	7427	7304
q19	1676	1682	1682	1682
q20	529	315	298	298
q21	4456	3949	4018	3949
q22	481	366	367	366
Total cold run time: 53202 ms
Total hot run time: 49059 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	4625	4608	4566	4566
q2	335	248	276	248
q3	4046	4009	3998	3998
q4	2721	2706	2714	2706
q5	9712	9671	9686	9671
q6	251	123	124	123
q7	2601	2311	2238	2238
q8	4421	4438	4425	4425
q9	13196	13105	13188	13105
q10	4083	4191	4195	4191
q11	811	662	653	653
q12	985	805	807	805
q13	4309	3562	3568	3562
q14	390	361	366	361
q15	566	522	519	519
q16	758	663	658	658
q17	3849	3849	3787	3787
q18	9445	8991	8998	8991
q19	1831	1768	1773	1768
q20	2401	2049	2028	2028
q21	8943	8536	8594	8536
q22	880	825	810	810
Total cold run time: 81159 ms
Total hot run time: 77749 ms

doris-robot avatar Nov 16 '23 15:11 doris-robot

run buildall

sollhui avatar Nov 17 '23 08:11 sollhui

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Nov 17 '23 08:11 github-actions[bot]

TeamCity be ut coverage result: Function Coverage: 36.74% (8413/22898) Line Coverage: 29.26% (68434/233898) Region Coverage: 27.84% (35375/127060) Branch Coverage: 24.60% (18070/73468) Coverage Report: http://coverage.selectdb-in.cc/coverage/8e38c25e0d50b56a352a12d00caf3fe542e6492a_8e38c25e0d50b56a352a12d00caf3fe542e6492a/report/index.html

doris-robot avatar Nov 17 '23 09:11 doris-robot

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
Tpch sf100 test result on commit 8e38c25e0d50b56a352a12d00caf3fe542e6492a, data reload: false

run tpch-sf100 query with default conf and session variables
q1	4971	4693	4679	4679
q2	364	152	158	152
q3	2036	1946	1883	1883
q4	1377	1249	1273	1249
q5	3983	3972	4028	3972
q6	243	134	136	134
q7	1379	853	879	853
q8	2778	2829	2797	2797
q9	9837	9793	9658	9658
q10	3447	3534	3535	3534
q11	371	250	251	250
q12	440	295	295	295
q13	4564	3799	3775	3775
q14	323	294	298	294
q15	586	529	524	524
q16	672	590	585	585
q17	1158	945	940	940
q18	7895	7425	7339	7339
q19	1681	1679	1674	1674
q20	522	311	298	298
q21	4435	3990	3965	3965
q22	483	382	375	375
Total cold run time: 53545 ms
Total hot run time: 49225 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	4613	4585	4594	4585
q2	347	220	252	220
q3	4043	3994	4007	3994
q4	2714	2714	2693	2693
q5	9739	9594	9648	9594
q6	242	123	125	123
q7	2588	2216	2290	2216
q8	4465	4480	4489	4480
q9	13222	13132	13072	13072
q10	4090	4203	4226	4203
q11	788	639	681	639
q12	977	806	813	806
q13	4282	3553	3585	3553
q14	384	349	348	348
q15	569	520	533	520
q16	746	660	649	649
q17	3854	3858	3804	3804
q18	9658	9026	9139	9026
q19	1824	1800	1785	1785
q20	2401	2067	2067	2067
q21	8815	8419	8434	8419
q22	928	812	824	812
Total cold run time: 81289 ms
Total hot run time: 77608 ms

doris-robot avatar Nov 17 '23 09:11 doris-robot

(From new machine)TeamCity pipeline, clickbench performance test result: the sum of best hot time: 45.52 seconds stream load tsv: 576 seconds loaded 74807831229 Bytes, about 123 MB/s stream load json: 18 seconds loaded 2358488459 Bytes, about 124 MB/s stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s insert into select: 28.6 seconds inserted 10000000 Rows, about 349K ops/s storage size: 17099573475 Bytes

doris-robot avatar Nov 17 '23 09:11 doris-robot

run buildall

sollhui avatar Nov 17 '23 14:11 sollhui

run buildall

sollhui avatar Nov 17 '23 14:11 sollhui

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Nov 17 '23 14:11 github-actions[bot]

TeamCity be ut coverage result: Function Coverage: 36.76% (8429/22927) Line Coverage: 29.28% (68593/234291) Region Coverage: 27.86% (35462/127264) Branch Coverage: 24.59% (18096/73580) Coverage Report: http://coverage.selectdb-in.cc/coverage/065080943cb0bc531821fc96d5718aa1f8ccfb04_065080943cb0bc531821fc96d5718aa1f8ccfb04/report/index.html

doris-robot avatar Nov 17 '23 15:11 doris-robot

TeamCity be ut coverage result: Function Coverage: 36.78% (8432/22927) Line Coverage: 29.28% (68602/234291) Region Coverage: 27.86% (35459/127264) Branch Coverage: 24.59% (18094/73580) Coverage Report: http://coverage.selectdb-in.cc/coverage/6879db1185649e1f28f7933c39301ffd78cdffeb_6879db1185649e1f28f7933c39301ffd78cdffeb/report/index.html

doris-robot avatar Nov 17 '23 15:11 doris-robot

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
Tpch sf100 test result on commit 6879db1185649e1f28f7933c39301ffd78cdffeb, data reload: false

run tpch-sf100 query with default conf and session variables
q1	4966	4698	4707	4698
q2	362	145	158	145
q3	2037	1899	1891	1891
q4	1381	1235	1230	1230
q5	3973	3866	3960	3866
q6	250	133	132	132
q7	1429	890	890	890
q8	2774	2782	2765	2765
q9	9523	19371	9494	9494
q10	3440	3492	3498	3492
q11	379	242	250	242
q12	450	291	297	291
q13	4553	3838	3776	3776
q14	309	294	285	285
q15	574	528	532	528
q16	664	586	582	582
q17	1137	982	921	921
q18	7810	7356	7445	7356
q19	1694	1693	1674	1674
q20	523	319	286	286
q21	4457	4000	4046	4000
q22	480	368	377	368
Total cold run time: 53165 ms
Total hot run time: 48912 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	4569	4556	4620	4556
q2	345	224	263	224
q3	4014	4025	3990	3990
q4	2714	2711	2709	2709
q5	9629	9575	9652	9575
q6	245	121	124	121
q7	3036	2509	2491	2491
q8	4478	4442	4439	4439
q9	12865	12824	12753	12753
q10	4070	4182	4150	4150
q11	777	721	651	651
q12	982	815	824	815
q13	4307	3545	3523	3523
q14	373	370	356	356
q15	575	517	518	517
q16	731	664	659	659
q17	3893	3885	3838	3838
q18	9626	9053	9092	9053
q19	1828	1750	1765	1750
q20	2373	2054	2019	2019
q21	8713	8762	8614	8614
q22	874	759	746	746
Total cold run time: 81017 ms
Total hot run time: 77549 ms

doris-robot avatar Nov 17 '23 16:11 doris-robot

(From new machine)TeamCity pipeline, clickbench performance test result: the sum of best hot time: 46.12 seconds stream load tsv: 579 seconds loaded 74807831229 Bytes, about 123 MB/s stream load json: 18 seconds loaded 2358488459 Bytes, about 124 MB/s stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s stream load parquet: 33 seconds loaded 861443392 Bytes, about 24 MB/s insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s storage size: 17101511571 Bytes

doris-robot avatar Nov 17 '23 16:11 doris-robot

(From new machine)TeamCity pipeline, clickbench performance test result: the sum of best hot time: 45.87 seconds stream load tsv: 580 seconds loaded 74807831229 Bytes, about 123 MB/s stream load json: 18 seconds loaded 2358488459 Bytes, about 124 MB/s stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s stream load parquet: 34 seconds loaded 861443392 Bytes, about 24 MB/s insert into select: 28.0 seconds inserted 10000000 Rows, about 357K ops/s storage size: 17100227694 Bytes

doris-robot avatar Nov 17 '23 16:11 doris-robot

run buildall

sollhui avatar Nov 18 '23 02:11 sollhui

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Nov 18 '23 02:11 github-actions[bot]

TeamCity be ut coverage result: Function Coverage: 36.75% (8432/22946) Line Coverage: 29.25% (68587/234499) Region Coverage: 27.86% (35458/127294) Branch Coverage: 24.58% (18089/73578) Coverage Report: http://coverage.selectdb-in.cc/coverage/8e32fad60e763d2c0ae7f22c6fef0e314ea63382_8e32fad60e763d2c0ae7f22c6fef0e314ea63382/report/index.html

doris-robot avatar Nov 18 '23 03:11 doris-robot

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
Tpch sf100 test result on commit 8e32fad60e763d2c0ae7f22c6fef0e314ea63382, data reload: false

run tpch-sf100 query with default conf and session variables
q1	4922	4651	4651	4651
q2	372	142	168	142
q3	2032	1879	1933	1879
q4	1380	1298	1275	1275
q5	3962	3951	4052	3951
q6	253	129	133	129
q7	1439	887	893	887
q8	2793	2789	2770	2770
q9	9725	9432	9607	9432
q10	3475	3495	3525	3495
q11	372	249	238	238
q12	441	292	292	292
q13	4580	3844	3778	3778
q14	326	298	292	292
q15	567	527	528	527
q16	672	597	586	586
q17	1133	969	961	961
q18	7899	7416	7386	7386
q19	1683	1680	1651	1651
q20	543	303	290	290
q21	4450	4018	4036	4018
q22	480	364	367	364
Total cold run time: 53499 ms
Total hot run time: 48994 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	4576	4602	4543	4543
q2	337	231	233	231
q3	4038	4042	4017	4017
q4	2715	2714	2703	2703
q5	9578	9599	9562	9562
q6	247	130	125	125
q7	3027	2500	2502	2500
q8	4410	4421	4445	4421
q9	12935	12927	12899	12899
q10	4063	4159	4136	4136
q11	803	643	728	643
q12	970	831	816	816
q13	4309	3595	3528	3528
q14	384	345	346	345
q15	577	523	522	522
q16	732	671	669	669
q17	3960	3829	3859	3829
q18	9536	8990	9152	8990
q19	1787	1760	1785	1760
q20	2402	2059	2047	2047
q21	8953	8581	8726	8581
q22	875	837	774	774
Total cold run time: 81214 ms
Total hot run time: 77641 ms

doris-robot avatar Nov 18 '23 04:11 doris-robot

(From new machine)TeamCity pipeline, clickbench performance test result: the sum of best hot time: 44.76 seconds stream load tsv: 563 seconds loaded 74807831229 Bytes, about 126 MB/s stream load json: 18 seconds loaded 2358488459 Bytes, about 124 MB/s stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s storage size: 17098936913 Bytes

doris-robot avatar Nov 18 '23 04:11 doris-robot

PR approved by at least one committer and no changes requested.

github-actions[bot] avatar Nov 27 '23 02:11 github-actions[bot]

PR approved by anyone and no changes requested.

github-actions[bot] avatar Nov 27 '23 02:11 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Dec 04 '23 02:12 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Dec 04 '23 07:12 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Dec 11 '23 08:12 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Dec 14 '23 12:12 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Dec 14 '23 12:12 github-actions[bot]