flightsql-dbapi icon indicating copy to clipboard operation
flightsql-dbapi copied to clipboard

Superset-generated time range causes SQL error: ParserError("Expected end of statement, found: 00")

Open jdstrand opened this issue 2 years ago • 5 comments

Superset has the ability to set a time range in its Chart Graph functionality. When set, the resulting SQL has a WHERE clause of this form:

WHERE time >= 2023-01-08 00:00:00.000000
  AND time < 2023-01-15 00:00:00.000000

This results in: executing query: failed to create Flight record reader: arrow/flight: could not create flight reader: arrow/ipc: could not read schema from stream: arrow/ipc: could not read message schema: rpc error: code = InvalidArgument desc = Error while planning query: SQL error: ParserError("Expected end of statement, found: 00").

A simple reproducer that generates an error without superset is:

SELECT * FROM "some_table"
WHERE time >= 2023-01-08 00:00:00.000000
  AND time < 2023-01-15 00:00:00.000000
LIMIT 10;

We can make the error go away if we quote the date strings:

SELECT * FROM "some_table"
WHERE time >= '2023-01-08 00:00:00.000000'
  AND time < '2023-01-15 00:00:00.000000'
LIMIT 10;

https://arrow.apache.org/datafusion/user-guide/sql/select.html#where-clause is pretty lean on details, but AIUI, it is using sqlparser which is ANSI SQL:2011. I tried to find the grammar for that but apparently it is behind a paywall. Postgres is mostly ANSI SQL:2016 compliant and section 8.5.1.3 of https://www.postgresql.org/docs/15/datatype-datetime.html speaks of TIMESTAMP '2004-10-19 10:23:54' being a valid SQL standard timestamp.

As such, this is likely most correct (ie, flightsql-dbapi likely should generate this form of WHERE):

SELECT * FROM "some_table"
WHERE time >= timestamp '2023-01-08 00:00:00.000000'
  AND time < timestamp '2023-01-15 00:00:00.000000'
LIMIT 10;

jdstrand avatar Feb 07 '23 22:02 jdstrand

@jdstrand I took a look and I'm not able to reproduce the issue on my side. I added some logging to print out the query handed to the DB API layer. Here's what my Superset UI shows:

Screenshot 2023-02-07 at 8 12 43 PM Screenshot 2023-02-07 at 8 15 33 PM

Here are the queries its producing:

superset_app          | EXECUTE SELECT time AS __timestamp,
superset_app          |                AVG(co) AS "AVG(co)"
superset_app          | FROM iox."airSensors"
superset_app          | WHERE time >= '2023-02-07 01:11:41.000000'
superset_app          |   AND time < '2023-02-08 01:11:41.000000'
superset_app          | GROUP BY time
superset_app          | ORDER BY "AVG(co)" DESC
superset_app          | LIMIT 10000 None
superset_app          | EXECUTE SELECT co AS co
superset_app          | FROM iox."airSensors"
superset_app          | WHERE time >= '2023-02-08 00:00:00.000000'
superset_app          |   AND time < '2023-02-08 00:00:00.000000'
superset_app          | GROUP BY co
superset_app          | LIMIT 1000 None

The trailing None is just the parameter list passed to the execute method in the DB API interface and can be ignored.

You can see it has single-quoted the datetimes in my case. I've tried a few variants including the custom, relative and "advanced" ranges. What am I missing?

brettbuddin avatar Feb 08 '23 01:02 brettbuddin

Reproducer from within Superset:

  1. follow https://docs.influxdata.com/influxdb/cloud-iox/visualize-data/superset/. My checkout is superset 9dfaad772d2ac42c35519a55e7b3df27ca03b072
  2. connect to an IOx database/bucket/namespace using datafusion+flightsql://<CLOUD>.cloud2.influxdata.com:443?bucket-name=BUCKETNAME&token=TOKEN (I chose an internal Multi-Tenant Cloud cluster, but presumably a public one would work too)
  3. go to the SQL lab and enter this (assumes that "some_table" has some fields, a timestamp column, etc):
    SELECT * FROM "some_table"
    LIMIT 1000
    
  4. Click Run and observe returned rows
  5. Click 'Create Chart'
  6. Create the chart:
    • Select 'Time Series Area' chart
    • Time Column: time (or whatever your timestamp column is)
    • Query:
      • Metrics: select a column and choose COUNT as aggregate
      • Dimensions: select another column
  7. Click 'Update Chart'. It should display something
  8. Click the 'Time Range' drop down, choose Custom and create a time range (under 'Actual' mine shows 2022-11-30 ≤ col < 2023-02-08). Click Apply
  9. Click 'Update Chart' and observe the error: Error: Flight returned invalid argument error, with message: Error while planning query: SQL error: ParserError("Expected end of statement, found: 00"). gRPC client debug context: UNKNOWN:Error received from peer ipv4:<redacted>:443 {grpc_message:"Error while planning query: SQL error: ParserError(\"Expected end of statement, found: 00\")", grpc_status:3, created_time:"2023-02-08T15:14:42.395137567+00:00"}. Client context: IOError: Server never sent a data message. Detail: Internal
  10. Click the ... icon in the upper right and click View Query. It shows unquoted times:
    WHERE time >= 2022-11-30 00:00:00.000000
      AND time < 2023-02-08 00:00:00.000000
    

Modifying step 8 to use 'No filter' makes it work again. Using 'Last', 'Previous' or 'Advanced' all causes the error. I also mimicked your use of custom with 'Start Relative Date/Time: 1 days before' and 'End: Now' to have the equivalent hover over in superset and still see the same behavior.

Updating step 6 to use 'Time Series Bar', 'Time Series Line', or 'Time Series Table' all results in the same error.

Wondering if it has something to do with the data here is an updated query on real data:

SELECT action, actor, time FROM "gh-audit"
WHERE actor = 'dependabot[bot]'
LIMIT 4

Clicking Run and downloading the CSV:

action,actor,time
pull_request.create,dependabot[bot],2023-01-24 13:31:48.795000
pull_request.create,dependabot[bot],2023-01-24 09:57:10.684000
pull_request.create,dependabot[bot],2023-01-24 01:06:51.266000
pull_request.create,dependabot[bot],2023-01-24 01:08:43.016000

If I start with step '5' above, I can still reproduce (Metrics is COUNT(action) and Dimension is actor). Presumably, this could be imported as a table in a bucket/namespace you own.

jdstrand avatar Feb 08 '23 15:02 jdstrand

I'll review this today. Thanks for the thorough instructions.

brettbuddin avatar Feb 08 '23 15:02 brettbuddin

I was able to reproduce this issue as @jdstrand describes above. The difference between the way he entered the chart builder is that he's coming in via the SQL Lab. The SQL Lab flow allows you to build a query that can then be used as a virtual table (via a sub-select). Here's a screenshot comparing the output SQL for both cases:

Screenshot 2023-02-08 at 3 05 32 PM

I'm unsure what the difference is in how SQLAlchemy is involved—or if it's using a completely different way of generating queries if querying via the virtual table. Looking into that now.

brettbuddin avatar Feb 08 '23 20:02 brettbuddin

Even weirder, @jdstrand: If you create a Dataset from the SQL Lab query and go through the same flow, you arrive at this:

Screenshot 2023-02-08 at 3 17 07 PM

So if it's coming from a Dataset at all—whether it's the abstract table or a specific select statement—it is able to quote the dates correctly. However, if you come in from the SQL Lab and don't attempt to persist the query into a Dataset record, it doesn't single-quote them.

brettbuddin avatar Feb 08 '23 20:02 brettbuddin