Only apply predicate for the first row in specific time series when filtering record batch

Open Rachelint opened this issue 2 years ago • 0 comments

Describe This Problem

Data organization

When using the primary key tsid + timestmap, the data will be ordered like:

tag1, tag2, tag3, timestamp1, field1 
tag1, tag2, tag3, timestamp2, field2
tag1, tag2, tag3, timestamp3, field3
tag1, tag2, tag3, timestamp4, field4
...

Filter in memory

However the order information is not used when filtering the pulled record batch. We just simply apply the predicate to the record batch, that means something like tag1, tag2, tag3 is possible to compute multiple times.

We just do a simple experiment to estimate the cost about duplicated computation above:

Schema: a: int32, b: string, c: dictionary
A big set has multiple duplicated rows, a small set has just one row
Predicates: int32 eq(for a), string eq(for b,c), string like(for b,c), string regex(for b,c)
Predicate 10000 times.

Experimetn results(see detail in Additional Context):

As we can see, duplicated computation about strings become much slower when big / set >= 500.
For eq/not eq case, dict has better performance compared to string(but worse than string's small set).
For like/not like/regex case, dict has worse performance compared to string.

It is easy to understand:

For eq/not eq case, strings comparison changed to int8s comparison when using dict format, that led to a better performance.

original expr:c = Utf8("100")
rewritten expr:c = CAST(Utf8("100") AS Dictionary(Int8, Utf8))

However, for like/not like/regex case, we can't do so and need to replace the dict idx with real word before comparing.

original expr:c LIKE Utf8("10%")
rewritten expr:CAST(c AS Utf8) LIKE Utf8("10%")

Conclusion

Maybe we should design to remove the dupliacte computation taking advantage of the order(especially for string type column).

Proposal

Additional Context

# big set / small set = 10:
## 1.big set / small set in each case
### a@0 = 100
big cost:540 micro
small cost:444 micro
big / small: 1.2162162
### a@0 != 100
big cost:486 micro
small cost:443 micro
big / small: 1.0970654
### b@1 = 100
big cost:596 micro
small cost:509 micro
big / small: 1.1709234
### b@1 != 100
big cost:531 micro
small cost:463 micro
big / small: 1.1468682
### b@1 LIKE 10%
big cost:404 micro
small cost:340 micro
big / small: 1.1882353
### b@1 NOT LIKE 10%
big cost:343 micro
small cost:295 micro
big / small: 1.1627119
### regexpmatch(b@1, 1*0{2}[a|b|c]*)
big cost:94968 micro
small cost:56659 micro
big / small: 1.6761327
### c@2 = CAST(100 AS Dictionary(Int8, Utf8))
big cost:4226 micro
small cost:4006 micro
big / small: 1.0549176
### c@2 != CAST(100 AS Dictionary(Int8, Utf8))
big cost:3959 micro
small cost:3820 micro
big / small: 1.0363874
### CAST(c@2 AS Utf8) LIKE 10%
big cost:1659 micro
small cost:1449 micro
big / small: 1.1449275
### CAST(c@2 AS Utf8) NOT LIKE 10%
big cost:1534 micro
small cost:1613 micro
big / small: 0.9510229
### regexpmatch(CAST(c@2 AS Utf8), 1*0{2}[a|b|c]*)
big cost:95559 micro
small cost:56795 micro
big / small: 1.6825249
## 2.dict / string in cases
case_eq small dict / string:7.870334
case_eq big dict / string:7.090604
case_not_eq small dict / string:8.25054
case_not_eq big dict / string:7.455744
case_like small dict / string:4.2617645
case_like big dict / string:4.106436
case_not_like small dict / string:5.467797
case_not_like big dict / string:4.4723034
case_regex small dict / string:1.0024003
case_regex big dict / string:1.0062232

# big set / small set = 100:
## 1.big set / small set in each case
### a@0 = 100
big cost:510 micro
small cost:458 micro
big / small: 1.1135371
### a@0 != 100
big cost:469 micro
small cost:454 micro
big / small: 1.0330397
### b@1 = 100
big cost:972 micro
small cost:485 micro
big / small: 2.0041237
### b@1 != 100
big cost:982 micro
small cost:470 micro
big / small: 2.0893617
### b@1 LIKE 10%
big cost:852 micro
small cost:374 micro
big / small: 2.278075
### b@1 NOT LIKE 10%
big cost:841 micro
small cost:366 micro
big / small: 2.2978141
### regexpmatch(b@1, 1*0{2}[a|b|c]*)
big cost:455702 micro
small cost:59480 micro
big / small: 7.6614323
### c@2 = CAST(100 AS Dictionary(Int8, Utf8))
big cost:4154 micro
small cost:3857 micro
big / small: 1.0770029
### c@2 != CAST(100 AS Dictionary(Int8, Utf8))
big cost:4130 micro
small cost:4077 micro
big / small: 1.0129998
### CAST(c@2 AS Utf8) LIKE 10%
big cost:2928 micro
small cost:1637 micro
big / small: 1.7886378
### CAST(c@2 AS Utf8) NOT LIKE 10%
big cost:2924 micro
small cost:1579 micro
big / small: 1.851805
### regexpmatch(CAST(c@2 AS Utf8), 1*0{2}[a|b|c]*)
big cost:457516 micro
small cost:56681 micro
big / small: 8.07177
## 2.dict / string in cases
case_eq small dict / string:7.952577
case_eq big dict / string:4.2736626
case_not_eq small dict / string:8.674468
case_not_eq big dict / string:4.205703
case_like small dict / string:4.3770056
case_like big dict / string:3.4366198
case_not_like small dict / string:4.3142076
case_not_like big dict / string:3.4768133
case_regex small dict / string:0.9529422
case_regex big dict / string:1.0039806

# big set / small set = 500:
## 1.big set / small set in each case
### a@0 = 100
big cost:633 micro
small cost:472 micro
big / small: 1.3411016
### a@0 != 100
big cost:545 micro
small cost:453 micro
big / small: 1.2030905
### b@1 = 100
big cost:3008 micro
small cost:464 micro
big / small: 6.4827585
### b@1 != 100
big cost:3013 micro
small cost:466 micro
big / small: 6.4656653
### b@1 LIKE 10%
big cost:2763 micro
small cost:346 micro
big / small: 7.985549
### b@1 NOT LIKE 10%
big cost:2688 micro
small cost:302 micro
big / small: 8.900662
### regexpmatch(b@1, 1*0{2}[a|b|c]*)
big cost:1963855 micro
small cost:54656 micro
big / small: 35.931187
### c@2 = CAST(100 AS Dictionary(Int8, Utf8))
big cost:5112 micro
small cost:3817 micro
big / small: 1.3392717
### c@2 != CAST(100 AS Dictionary(Int8, Utf8))
big cost:4040 micro
small cost:3785 micro
big / small: 1.0673712
### CAST(c@2 AS Utf8) LIKE 10%
big cost:7107 micro
small cost:1530 micro
big / small: 4.645098
### CAST(c@2 AS Utf8) NOT LIKE 10%
big cost:6704 micro
small cost:1471 micro
big / small: 4.557444
### regexpmatch(CAST(c@2 AS Utf8), 1*0{2}[a|b|c]*)
big cost:1908174 micro
small cost:56650 micro
big / small: 33.683567
## 2.dict / string in cases
case_eq small dict / string:8.226294
case_eq big dict / string:1.6994681
case_not_eq small dict / string:8.122317
case_not_eq big dict / string:1.3408563
case_like small dict / string:4.421965
case_like big dict / string:2.572204
case_not_like small dict / string:4.870861
case_not_like big dict / string:2.4940476
case_regex small dict / string:1.0364827
case_regex big dict / string:0.9716471

# big set / small set = 1000:
## 1.big set / small set in each case
### a@0 = 100
big cost:668 micro
small cost:470 micro
big / small: 1.4212766
### a@0 != 100
big cost:606 micro
small cost:470 micro
big / small: 1.2893617
### b@1 = 100
big cost:5916 micro
small cost:478 micro
big / small: 12.376569
### b@1 != 100
big cost:5578 micro
small cost:494 micro
big / small: 11.291498
### b@1 LIKE 10%
big cost:5260 micro
small cost:372 micro
big / small: 14.139785
### b@1 NOT LIKE 10%
big cost:5252 micro
small cost:291 micro
big / small: 18.04811
### regexpmatch(b@1, 1*0{2}[a|b|c]*)
big cost:3802702 micro
small cost:54614 micro
big / small: 69.6287
### c@2 = CAST(100 AS Dictionary(Int8, Utf8))
big cost:6062 micro
small cost:3875 micro
big / small: 1.5643871
### c@2 != CAST(100 AS Dictionary(Int8, Utf8))
big cost:4476 micro
small cost:3847 micro
big / small: 1.163504
### CAST(c@2 AS Utf8) LIKE 10%
big cost:11937 micro
small cost:1502 micro
big / small: 7.9474034
### CAST(c@2 AS Utf8) NOT LIKE 10%
big cost:12094 micro
small cost:1435 micro
big / small: 8.427875
### regexpmatch(CAST(c@2 AS Utf8), 1*0{2}[a|b|c]*)
big cost:3786914 micro
small cost:57731 micro
big / small: 65.59585
## 2.dict / string in cases
case_eq small dict / string:8.106694
case_eq big dict / string:1.0246788
case_not_eq small dict / string:7.7874494
case_not_eq big dict / string:0.80243814
case_like small dict / string:4.0376344
case_like big dict / string:2.2693915
case_not_like small dict / string:4.9312716
case_not_like big dict / string:2.3027418
case_regex small dict / string:1.0570732
case_regex big dict / string:0.99584824

# big set / small set = 4000:
## 1.big set / small set in each case
### a@0 = 100
big cost:1183 micro
small cost:468 micro
big / small: 2.5277777
### a@0 != 100
big cost:1053 micro
small cost:466 micro
big / small: 2.2596567
### b@1 = 100
big cost:21375 micro
small cost:582 micro
big / small: 36.726803
### b@1 != 100
big cost:21263 micro
small cost:482 micro
big / small: 44.11411
### b@1 LIKE 10%
big cost:19797 micro
small cost:355 micro
big / small: 55.766197
### b@1 NOT LIKE 10%
big cost:19641 micro
small cost:303 micro
big / small: 64.821785
### regexpmatch(b@1, 1*0{2}[a|b|c]*)
big cost:14936214 micro
small cost:56598 micro
big / small: 263.90002
### c@2 = CAST(100 AS Dictionary(Int8, Utf8))
big cost:12052 micro
small cost:3855 micro
big / small: 3.1263294
### c@2 != CAST(100 AS Dictionary(Int8, Utf8))
big cost:6440 micro
small cost:3697 micro
big / small: 1.7419529
### CAST(c@2 AS Utf8) LIKE 10%
big cost:43414 micro
small cost:1550 micro
big / small: 28.009033
### CAST(c@2 AS Utf8) NOT LIKE 10%
big cost:43047 micro
small cost:1564 micro
big / small: 27.523657
### regexpmatch(CAST(c@2 AS Utf8), 1*0{2}[a|b|c]*)
big cost:14997344 micro
small cost:57108 micro
big / small: 262.6137
## 2.dict / string in cases
case_eq small dict / string:6.623711
case_eq big dict / string:0.5638363
case_not_eq small dict / string:7.6701245
case_not_eq big dict / string:0.30287352
case_like small dict / string:4.366197
case_like big dict / string:2.1929586
case_not_like small dict / string:5.161716
case_not_like big dict / string:2.191691
case_regex small dict / string:1.0090109
case_regex big dict / string:1.0040927

# big set / small set = 8000:
## 1.big set / small set in each case
### a@0 = 100
big cost:1499 micro
small cost:477 micro
big / small: 3.1425576
### a@0 != 100
big cost:1544 micro
small cost:484 micro
big / small: 3.1900826
### b@1 = 100
big cost:41883 micro
small cost:499 micro
big / small: 83.93387
### b@1 != 100
big cost:41595 micro
small cost:492 micro
big / small: 84.54269
### b@1 LIKE 10%
big cost:38935 micro
small cost:359 micro
big / small: 108.45404
### b@1 NOT LIKE 10%
big cost:38706 micro
small cost:338 micro
big / small: 114.51479
### regexpmatch(b@1, 1*0{2}[a|b|c]*)
big cost:29740878 micro
small cost:55639 micro
big / small: 534.53296
### c@2 = CAST(100 AS Dictionary(Int8, Utf8))
big cost:20559 micro
small cost:4004 micro
big / small: 5.1346154
### c@2 != CAST(100 AS Dictionary(Int8, Utf8))
big cost:9026 micro
small cost:4060 micro
big / small: 2.2231526
### CAST(c@2 AS Utf8) LIKE 10%
big cost:84157 micro
small cost:1520 micro
big / small: 55.366447
### CAST(c@2 AS Utf8) NOT LIKE 10%
big cost:84709 micro
small cost:1469 micro
big / small: 57.6644
### regexpmatch(CAST(c@2 AS Utf8), 1*0{2}[a|b|c]*)
big cost:29849912 micro
small cost:56893 micro
big / small: 524.66754
## 2.dict / string in cases
case_eq small dict / string:8.024048
case_eq big dict / string:0.4908674
case_not_eq small dict / string:8.252032
case_not_eq big dict / string:0.21699724
case_like small dict / string:4.2339835
case_like big dict / string:2.1614742
case_not_like small dict / string:4.3461537
case_not_like big dict / string:2.1885238
case_regex small dict / string:1.0225382
case_regex big dict / string:1.0036662

No response

Jun 02 '23 03:06 Rachelint