Exception: Internal error, Exception: Schema error
Describe the bug
Case 1... Exception: Internal error: Impossibly got empty window expression. This was likely caused by a bug in DataFusion's code.
Case 2... Exception: Schema error(same cause as Case 1 ?)
To Reproduce
import datafusion
ctx = datafusion.SessionContext()
datafusion.__version__
'0.6.0'
!echo "a,b\n1,4\n2,5\n3,6" > example.csv
ctx.register_csv('example', 'example.csv')
ctx.sql('SELECT * from example').show()
+---+---+ | a | b | +---+---+ | 1 | 4 | | 2 | 5 | | 3 | 6 | +---+---+
Case 1... Exception: Internal error
sql = '''
SELECT AVG(b) AS median_value
FROM (
SELECT
b,
COUNT(b) OVER () AS row_count,
ROW_NUMBER() OVER (ORDER BY b) AS row_number
FROM example
)
ORDER BY median_value
'''
df = ctx.sql(sql)
df.show()
Exception Traceback (most recent call last)
Input In [65], in
3 FROM (
(...)
10 ORDER BY median_value
11 '''
12 df = ctx.sql(sql)
---> 13 df.show()
Exception: Internal error: Impossibly got empty window expression. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
Case 2... Exception: Schema error
sql = '''
SELECT AVG(b) AS median_value
FROM (
SELECT
b,
COUNT(b) OVER () AS row_count,
ROW_NUMBER() OVER (ORDER BY b) AS row_number
FROM example
)
WHERE row_number IN ((row_count + 1) / 2, (row_count + 2) / 2)
'''
df = ctx.sql(sql)
df.show()
Exception Traceback (most recent call last)
Input In [57], in
3 FROM (
(...)
10 WHERE row_number IN ((row_count + 1) / 2, (row_count + 2) / 2)
11 '''
12 df = ctx.sql(sql)
---> 13 df.show()
Exception: Schema error: No field named 'row_number'. Valid fields are 'example.b'.
Additional context
SQL is meant to be reproduced, and the processing content has no meaning. I think SQL is correct. Is there a way around it?
You're using the datafusion-python project which is using version 8 of the DataFusion library. In latest DataFusion (master as of now), the first query works fine now thanks to recent fixes, but the second query returns with
thread 'tokio-runtime-worker' panicked at 'not implemented: InList does not yet support nested columns.',
There's a recent PR to update datafusion-python to version 10 of the DataFusion library, which should fix the issue with the first query. The second query appears to trigger a new bug/unsupported feature.
Thank you for your comment. Case 2 wants to get the median value, so consider other methods.