sql icon indicating copy to clipboard operation
sql copied to clipboard

[FEATURE] PPL ClickBench queries should produce results consistent with ClickHouse/Spark

Open penghuo opened this issue 2 months ago • 4 comments

Is your feature request related to a problem? Check 2 queries in PPLClickBenchIT, it produce in-consistent results compare to spark clickbench results.

Q43

Current

  • PPL
source=hits
| where CounterID = 62 and EventDate >= '2013-07-01 00:00:00' and EventDate <= '2013-07-15 00:00:00' and IsRefresh = 0 and DontCountHits = 0
| eval M = date_format(EventTime, '%Y-%m-%d %H:00:00')
| stats bucket_nullable=false count() as PageViews by M
| sort M
| head 10 from 1000

Expected

  • ppl
source=hits | where CounterID = 62 and EventDate >= '2013-07-01 00:00:00' and EventDate <= '2013-07-15 00:00:00' and IsRefresh = 0 and DontCountHits = 0 | stats bucket_nullable=false count by span(EventTime, 1m) as M | sort M | head 10 from 1000
  • spark sql
SELECT DATE_TRUNC('minute', EventTime) AS M, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-14' AND EventDate <= '2013-07-15' AND IsRefresh = 0 AND DontCountHits = 0 GROUP BY DATE_TRUNC('minute', EventTime) ORDER BY DATE_TRUNC('minute', EventTime) LIMIT 10 OFFSET 1000;
  • results
+-------------------+---------+
|                  M|PageViews|
+-------------------+---------+
|2013-07-15 12:40:00|      513|
|2013-07-15 12:41:00|      457|
|2013-07-15 12:42:00|      470|
|2013-07-15 12:43:00|      468|
|2013-07-15 12:44:00|      453|
|2013-07-15 12:45:00|      462|
|2013-07-15 12:46:00|      481|
|2013-07-15 12:47:00|      458|
|2013-07-15 12:48:00|      466|
|2013-07-15 12:49:00|      467|
+-------------------+---------+

Q28

Current

  • ppl
source=hits
| where URL != ''
| stats bucket_nullable=false avg(length(URL)) as l, count() as c by CounterID
| where c > 100000
| sort - l
| head 25
  • results
l,c,CounterID
453.1628285069236,2938865,233773
261.55616482670234,2510103,245438
230.68442591186866,3574007,122612
198.08936562473812,238660,234004
187.9562693941447,323229,1634
180.36009889818135,120528,786
139.11823037965326,216408,114157
123.96623033619909,146907,515
121.86100788770537,858171,256004
116.91921850037136,374306,95427
106.89180332891429,7115413,199550
103.7586784846365,494614,220992
95.01398682515553,163797,196239
92.1134607276574,396093,96948
91.9427013479645,738150,62
90.26380938339348,311998,188878
90.00957407022646,120325,249603
85.8864014117864,8527069,3922
85.03551145479048,124664,191697
82.76263550290444,131178,97467
82.05937991006292,802561,186300
74.73223497987863,169223,230962
74.65631730856313,253961,77639
74.13011535042938,605286,146891
73.96797959706166,507770,38

Expected

  • spark sql
SELECT CounterID, AVG(length(URL)) AS l, COUNT(*) AS c FROM hits WHERE URL <> '' GROUP BY CounterID HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
+---------+------------------+-------+
|CounterID|                 l|      c|
+---------+------------------+-------+
|   233773| 453.1628285069236|2938865|
|   245438|261.55616482670234|2510103|
|   122612|230.68442591186866|3574007|
|   234004|198.08936562473812| 238660|
|     1634| 187.9562693941447| 323229|
|      786|180.36009889818135| 120528|
|   114157|139.11823037965326| 216408|
|      515|123.96623033619909| 146907|
|   256004|121.86100672243644| 858171|
|    95427|116.91921850037136| 374306|
|   199550|106.89180332891429|7115413|
|   220992| 103.7586784846365| 494614|
|   196239| 95.01398682515553| 163797|
|    96948|  92.1134607276574| 396093|
|       62|  91.9427013479645| 738150|
|   188878| 90.26380938339348| 311998|
|   249603| 90.00957407022646| 120325|
|     3922| 85.88640117723921|8527069|
|   191697| 85.03551145479048| 124664|
|    97467| 82.76263550290444| 131178|
+---------+------------------+-------+

penghuo avatar Nov 07 '25 16:11 penghuo

Spark just returns 20 rows instead of expected 25 rows. Do you run spark.sql("xxxx").show()? Need to change to .show(25)

LantaoJin avatar Nov 08 '25 07:11 LantaoJin

stats bucket_nullable=false count by span(EventTime, 1m) as M

Are you sure we can rewrite it to above ppl, although the result is same. My thought is testing the datetime UDF in grouping in PPL.

LantaoJin avatar Nov 08 '25 07:11 LantaoJin

I will do an experiment to align spark and PPL

xinyual avatar Nov 10 '25 06:11 xinyual

stats bucket_nullable=false count by span(EventTime, 1m) as M

Are you sure we can rewrite it to above ppl, although the result is same. My thought is testing the datetime UDF in grouping in PPL.

I assume span has better results than date_format?

penghuo avatar Nov 11 '25 16:11 penghuo