flightsql-dbapi
flightsql-dbapi copied to clipboard
Superset-generated time range causes SQL error: ParserError("Expected end of statement, found: 00")
Superset has the ability to set a time range in its Chart Graph functionality. When set, the resulting SQL has a WHERE clause of this form:
WHERE time >= 2023-01-08 00:00:00.000000
AND time < 2023-01-15 00:00:00.000000
This results in: executing query: failed to create Flight record reader: arrow/flight: could not create flight reader: arrow/ipc: could not read schema from stream: arrow/ipc: could not read message schema: rpc error: code = InvalidArgument desc = Error while planning query: SQL error: ParserError("Expected end of statement, found: 00").
A simple reproducer that generates an error without superset is:
SELECT * FROM "some_table"
WHERE time >= 2023-01-08 00:00:00.000000
AND time < 2023-01-15 00:00:00.000000
LIMIT 10;
We can make the error go away if we quote the date strings:
SELECT * FROM "some_table"
WHERE time >= '2023-01-08 00:00:00.000000'
AND time < '2023-01-15 00:00:00.000000'
LIMIT 10;
https://arrow.apache.org/datafusion/user-guide/sql/select.html#where-clause is pretty lean on details, but AIUI, it is using sqlparser which is ANSI SQL:2011. I tried to find the grammar for that but apparently it is behind a paywall. Postgres is mostly ANSI SQL:2016 compliant and section 8.5.1.3 of https://www.postgresql.org/docs/15/datatype-datetime.html speaks of TIMESTAMP '2004-10-19 10:23:54' being a valid SQL standard timestamp.
As such, this is likely most correct (ie, flightsql-dbapi likely should generate this form of WHERE):
SELECT * FROM "some_table"
WHERE time >= timestamp '2023-01-08 00:00:00.000000'
AND time < timestamp '2023-01-15 00:00:00.000000'
LIMIT 10;
@jdstrand I took a look and I'm not able to reproduce the issue on my side. I added some logging to print out the query handed to the DB API layer. Here's what my Superset UI shows:
Here are the queries its producing:
superset_app | EXECUTE SELECT time AS __timestamp,
superset_app | AVG(co) AS "AVG(co)"
superset_app | FROM iox."airSensors"
superset_app | WHERE time >= '2023-02-07 01:11:41.000000'
superset_app | AND time < '2023-02-08 01:11:41.000000'
superset_app | GROUP BY time
superset_app | ORDER BY "AVG(co)" DESC
superset_app | LIMIT 10000 None
superset_app | EXECUTE SELECT co AS co
superset_app | FROM iox."airSensors"
superset_app | WHERE time >= '2023-02-08 00:00:00.000000'
superset_app | AND time < '2023-02-08 00:00:00.000000'
superset_app | GROUP BY co
superset_app | LIMIT 1000 None
The trailing None is just the parameter list passed to the execute method in the DB API interface and can be ignored.
You can see it has single-quoted the datetimes in my case. I've tried a few variants including the custom, relative and "advanced" ranges. What am I missing?
Reproducer from within Superset:
- follow https://docs.influxdata.com/influxdb/cloud-iox/visualize-data/superset/. My checkout is superset
9dfaad772d2ac42c35519a55e7b3df27ca03b072 - connect to an IOx database/bucket/namespace using
datafusion+flightsql://<CLOUD>.cloud2.influxdata.com:443?bucket-name=BUCKETNAME&token=TOKEN(I chose an internal Multi-Tenant Cloud cluster, but presumably a public one would work too) - go to the SQL lab and enter this (assumes that "some_table" has some fields, a timestamp column, etc):
SELECT * FROM "some_table" LIMIT 1000 - Click Run and observe returned rows
- Click 'Create Chart'
- Create the chart:
- Select 'Time Series Area' chart
- Time Column:
time(or whatever your timestamp column is) - Query:
- Metrics: select a column and choose COUNT as aggregate
- Dimensions: select another column
- Click 'Update Chart'. It should display something
- Click the 'Time Range' drop down, choose Custom and create a time range (under 'Actual' mine shows
2022-11-30 ≤ col < 2023-02-08). Click Apply - Click 'Update Chart' and observe the error:
Error: Flight returned invalid argument error, with message: Error while planning query: SQL error: ParserError("Expected end of statement, found: 00"). gRPC client debug context: UNKNOWN:Error received from peer ipv4:<redacted>:443 {grpc_message:"Error while planning query: SQL error: ParserError(\"Expected end of statement, found: 00\")", grpc_status:3, created_time:"2023-02-08T15:14:42.395137567+00:00"}. Client context: IOError: Server never sent a data message. Detail: Internal - Click the
...icon in the upper right and click View Query. It shows unquoted times:WHERE time >= 2022-11-30 00:00:00.000000 AND time < 2023-02-08 00:00:00.000000
Modifying step 8 to use 'No filter' makes it work again. Using 'Last', 'Previous' or 'Advanced' all causes the error. I also mimicked your use of custom with 'Start Relative Date/Time: 1 days before' and 'End: Now' to have the equivalent hover over in superset and still see the same behavior.
Updating step 6 to use 'Time Series Bar', 'Time Series Line', or 'Time Series Table' all results in the same error.
Wondering if it has something to do with the data here is an updated query on real data:
SELECT action, actor, time FROM "gh-audit"
WHERE actor = 'dependabot[bot]'
LIMIT 4
Clicking Run and downloading the CSV:
action,actor,time
pull_request.create,dependabot[bot],2023-01-24 13:31:48.795000
pull_request.create,dependabot[bot],2023-01-24 09:57:10.684000
pull_request.create,dependabot[bot],2023-01-24 01:06:51.266000
pull_request.create,dependabot[bot],2023-01-24 01:08:43.016000
If I start with step '5' above, I can still reproduce (Metrics is COUNT(action) and Dimension is actor). Presumably, this could be imported as a table in a bucket/namespace you own.
I'll review this today. Thanks for the thorough instructions.
I was able to reproduce this issue as @jdstrand describes above. The difference between the way he entered the chart builder is that he's coming in via the SQL Lab. The SQL Lab flow allows you to build a query that can then be used as a virtual table (via a sub-select). Here's a screenshot comparing the output SQL for both cases:
I'm unsure what the difference is in how SQLAlchemy is involved—or if it's using a completely different way of generating queries if querying via the virtual table. Looking into that now.
Even weirder, @jdstrand: If you create a Dataset from the SQL Lab query and go through the same flow, you arrive at this:
So if it's coming from a Dataset at all—whether it's the abstract table or a specific select statement—it is able to quote the dates correctly. However, if you come in from the SQL Lab and don't attempt to persist the query into a Dataset record, it doesn't single-quote them.