test-suite-sql-eval icon indicating copy to clipboard operation
test-suite-sql-eval copied to clipboard

Inconsistent SQL Query Results on Different Systems

Open minger-hsxz opened this issue 7 months ago • 0 comments

I encountered an issue while using your project. The problem arises because the pred results differ between Windows and Linux systems.

Description

Gold SQL:

SELECT YEAR FROM concert GROUP BY YEAR ORDER BY count(*) DESC LIMIT 1;

Pred SQL:

SELECT MAX(Year) FROM concert;

Schema and Data The table schema and sample data used are as follows:

CREATE TABLE "concert" (
  "concert_ID" int,
  "concert_Name" text,
  "Theme" text,
  "Stadium_ID" text,
  "Year" text,
  PRIMARY KEY ("concert_ID"),
  FOREIGN KEY ("Stadium_ID") REFERENCES "stadium"("Stadium_ID")
);

INSERT INTO "concert" VALUES (1, "Auditions", "Free choice", 1, 2014);
INSERT INTO "concert" VALUES (2, "Super bootcamp", "Free choice 2", 2, 2014);
INSERT INTO "concert" VALUES (3, "Home Visits", "Bleeding Love", 2, 2015);
INSERT INTO "concert" VALUES (4, "Week 1", "Wide Awake", 10, 2014);
INSERT INTO "concert" VALUES (5, "Week 1", "Happy Tonight", 9, 2015);
INSERT INTO "concert" VALUES (6, "Week 2", "Party All Night", 7, 2015);

Observations

When executing the queries on different systems, the results for pred were inconsistent:

On Windows 11:

Gold: [('2014',)] Pred: [('2015',)]

On Linux:

Gold: [('2015',)] Pred: [('2015',)]

Analysis

Both 2014 and 2015 have the same count of concerts (3 each). Therefore, either year could be returned as the correct result. However, the discrepancy between systems suggests an underlying difference in implementation that affects the pred query results.

The main issue is that pred is not consistent across different systems, leading to potential false negatives when comparing it to gold. This example highlights a potential flaw in the evaluation method used by the project.

Conclusion

This issue demonstrates that the same SQL query can yield different results on different systems, affecting the reliability of the evaluation process. Further investigation into the underlying cause of this discrepancy and how to handle such cases in the evaluation would be valuable.

minger-hsxz avatar Jul 06 '24 06:07 minger-hsxz