datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Exception: Internal error, Exception: Schema error

Open ike560 opened this issue 3 years ago • 2 comments

Describe the bug

Case 1... Exception: Internal error: Impossibly got empty window expression. This was likely caused by a bug in DataFusion's code.
Case 2... Exception: Schema error(same cause as Case 1 ?)

To Reproduce

import datafusion
ctx = datafusion.SessionContext()
datafusion.__version__

'0.6.0'

!echo "a,b\n1,4\n2,5\n3,6" > example.csv
ctx.register_csv('example', 'example.csv')
ctx.sql('SELECT * from example').show()

+---+---+ | a | b | +---+---+ | 1 | 4 | | 2 | 5 | | 3 | 6 | +---+---+

Case 1... Exception: Internal error

sql = '''
SELECT AVG(b) AS median_value  
  FROM ( 
    SELECT 
      b, 
      COUNT(b) OVER () AS row_count, 
      ROW_NUMBER() OVER (ORDER BY b) AS row_number 
    FROM example 
  )
  ORDER BY median_value
  '''
df = ctx.sql(sql)
df.show()

Exception Traceback (most recent call last) Input In [65], in   1 sql = '''   2 SELECT AVG(b) AS median_value
  3 FROM (  (...)   10 ORDER BY median_value   11 '''   12 df = ctx.sql(sql) ---> 13 df.show()

Exception: Internal error: Impossibly got empty window expression. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

Case 2... Exception: Schema error

sql = '''
SELECT AVG(b) AS median_value  
  FROM ( 
    SELECT 
      b, 
      COUNT(b) OVER () AS row_count, 
      ROW_NUMBER() OVER (ORDER BY b) AS row_number 
    FROM example 
  ) 
  WHERE row_number IN ((row_count + 1) / 2, (row_count + 2) / 2)  
  '''
df = ctx.sql(sql)
df.show()

Exception Traceback (most recent call last) Input In [57], in   1 sql = '''   2 SELECT AVG(b) AS median_value
  3 FROM (   (...)   10 WHERE row_number IN ((row_count + 1) / 2, (row_count + 2) / 2)
  11 '''   12 df = ctx.sql(sql) ---> 13 df.show()

Exception: Schema error: No field named 'row_number'. Valid fields are 'example.b'.

Additional context

SQL is meant to be reproduced, and the processing content has no meaning. I think SQL is correct. Is there a way around it?

ike560 avatar Jul 18 '22 09:07 ike560

You're using the datafusion-python project which is using version 8 of the DataFusion library. In latest DataFusion (master as of now), the first query works fine now thanks to recent fixes, but the second query returns with

thread 'tokio-runtime-worker' panicked at 'not implemented: InList does not yet support nested columns.', 

There's a recent PR to update datafusion-python to version 10 of the DataFusion library, which should fix the issue with the first query. The second query appears to trigger a new bug/unsupported feature.

kmitchener avatar Jul 18 '22 19:07 kmitchener

Thank you for your comment. Case 2 wants to get the median value, so consider other methods.

ghost avatar Jul 19 '22 01:07 ghost