datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

TPC-H Query 6 has a wrong result

Open alamb opened this issue 4 years ago • 7 comments

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-12218

TPC-H Query 6 gives a wrong result according to the test in the benchmarks.

{{TPCH_DATA=[..]/tpch-dbgen cargo test --release}} Query 6 iteration 0 took 6137.1 ms Query 6 avg time: 6137.09 ms thread 'tests::q6' panicked at 'assertion failed: (left == right)  left: ["123141078.23"], right: ["75207768.18550001"]', benchmarks/src/bin/tpch.rs:1684:17

[~alamb]

alamb avatar Apr 26 '21 13:04 alamb

The problem with the query is that 0.5 + 0.2 gets parsed as 0.69999999_

Projection: #SUM(bench.public.lineitem.l_extendedprice * bench.public.lineitem.l_discount) AS revenue
  Aggregate: groupBy=[[]], aggr=[[SUM(#bench.public.lineitem.l_extendedprice * #bench.public.lineitem.l_discount)]]
    Filter: #bench.public.lineitem.l_shipdate >= Date32("8766") AND #bench.public.lineitem.l_shipdate < Date32("9131") AND #bench.public.lineitem.l_discount BETWEEN Float64(0.049999999999999996) AND Float64(0.06999999999999999) AND #bench.public.lineitem.l_quantity < Int64(24)
      TableScan: bench.public.lineitem projection=Some([l_quantity, l_extendedprice, l_discount, l_shipdate]), partial_filters=[#bench.public.lineitem.l_shipdate >= Date32("8766"), #bench.public.lineitem.l_shipdate < Date32("9131"), #bench.public.lineitem.l_discount BETWEEN Float64(0.049999999999999996) AND Float64(0.06999999999999999), #bench.public.lineitem.l_quantity < Int64(24)]

nevi-me avatar Jun 25 '22 19:06 nevi-me

🤔 maybe we need to use Decimal type rather than Float for constants. When we initially ran TPCH we did not have support for this data type, but now thanks to the work from @liukun4515 (and others) we do!

alamb avatar Jun 26 '22 12:06 alamb

🤔 maybe we need to use Decimal type rather than Float for constants. When we initially ran TPCH we did not have support for this data type, but now thanks to the work from @liukun4515 (and others) we do!

That would probably do it. Postgres also converts decimal literals in the sql statement to decimal types. https://stackoverflow.com/questions/68142210/what-is-the-type-of-100-0-in-postgresql

I see where to make the change (parse_sql_number() inside planner.rs), but I'm struggling trying to figure out how to turn the string input to a Decimal128 type. Can you give me some pointers?

kmitchener avatar Sep 07 '22 18:09 kmitchener

Hm, I realize now I should be converting from float to Decimal128 :)

kmitchener avatar Sep 07 '22 19:09 kmitchener

@kmitchener After convert to decimal128, can we get the right result for Q6?

liukun4515 avatar Sep 08 '22 09:09 liukun4515

I see where to make the change (parse_sql_number() inside planner.rs), but I'm struggling trying to figure out how to turn the string input to a Decimal128 type. Can you give me some pointers?

Do you want to convert string value of string data type like "123.13" to the decimal128 data type?

liukun4515 avatar Sep 08 '22 09:09 liukun4515

hi @liukun4515, no, with decimal its still not the right result and I'm not sure why yet. but it's much closer:

Left:  ["123141078.23"]
Right: ["123140554.79"]

I think it's correct behavior to use the Decimal type to represent non-integer numeric literals in the SQL, so I'm going to open a separate issue for that and will ask you to review that, because I'm unsure how to convert to Decimal128 efficiently.

kmitchener avatar Sep 08 '22 11:09 kmitchener