starrocks icon indicating copy to clipboard operation
starrocks copied to clipboard

Shared-data cluster Query Profile Metric Problems with accuracy

Open kaka-zb opened this issue 1 year ago • 4 comments

Backgrounds

We tried to use sr replace elasticsearch to build the logging platform, and then during the testing process, we found that the query latency was very high, so we used query profile to locate the problem.

Problem description

image image

With this query profile result, it looks like there's a data skew problem,but there is something wrong with the metric data.

According to the sr document, InstanceNum means Number of all FragmentInstances for this Fragment, so the number of fragmentInstances for this Fragment is 6, but the iotime metrics above seems to be not correct, max iotime divided by 6 is also much larger than the average iotime.

StarRocks version

  • 3.2.6-2585333 (shared-data StarRocks cluster)

kaka-zb avatar May 15 '24 09:05 kaka-zb

query_profile.txt

kaka-zb avatar May 15 '24 09:05 kaka-zb

instance是6,pipeline还有还有自己的并行度:

Pipeline (id=0):
         - ActiveTime: 178.376ms
           - __MAX_OF_ActiveTime: 742.777ms
           - __MIN_OF_ActiveTime: 1.555ms
         - BlockByInputEmpty: 2.810K (2810)
           - __MAX_OF_BlockByInputEmpty: 502
           - __MIN_OF_BlockByInputEmpty: 4
         - BlockByOutputFull: 0
         - BlockByPrecondition: 0
         - DegreeOfParallelism: 4
         - DriverPrepareTime: 41.537us
           - __MAX_OF_DriverPrepareTime: 87.626us
           - __MIN_OF_DriverPrepareTime: 25.631us
         - DriverTotalTime: 1s687ms
           - __MAX_OF_DriverTotalTime: 5s857ms
           - __MIN_OF_DriverTotalTime: 709.111ms

并行度是4,所以io task一共是 4*6

wupan-olo avatar May 16 '24 08:05 wupan-olo

instance number is not the smallest concurrency granularity in StarRocks,every instance will use multi threads in io thread pool to execute io task

before-Sunrise avatar May 16 '24 08:05 before-Sunrise

got it !

so, according to profile results blew, can we surmise that there is a data skew problem, whether or not it hits the block cache.

a demo that hits the cache

- IOTimeLocalDisk: 355.869ms
 - __MAX_OF_IOTimeLocalDisk: 25s710ms
 - __MIN_OF_IOTimeLocalDisk: 51.257us
- IOTimeRemote: 0ns

a demo that misses the cache

- IOTimeLocalDisk: 6.265ms
 - __MAX_OF_IOTimeLocalDisk: 70.878ms
 - __MIN_OF_IOTimeLocalDisk: 0ns
- IOTimeRemote: 4s430ms
 - __MAX_OF_IOTimeRemote: 55s242ms
 - __MIN_OF_IOTimeRemote: 0ns

FYI, we enabled the Block Cache, by set starlet_use_star_cache=true

kaka-zb avatar May 16 '24 09:05 kaka-zb