greptimedb icon indicating copy to clipboard operation
greptimedb copied to clipboard

Incorrect schema returned from coprocessor generated recordbatches

Open sunng87 opened this issue 2 years ago • 2 comments

This bug was detected when refactoring our http ourtput format in #361, when using schema returned by RecordBatches and RecordBatchStream in http response.

In CoprStream, the recordbatch stream data type generated from script execution, the schema returned from its stream() function was directly copied from its input. And since the output column from coprocessor script has its own name and data type, this schema is incorrect for most cases.

In current coprocessor scenario, the data type of output is only available after the output data is generated. There is a gen_schema function that generates schema from output data, for each column processed. And because of python's dynamic typed nature, it's even possible to have different types of data in one column.

Proposed solutions

Add data type declaration to the @copr decorator, at least for its output, so we can generate output schema ahead of time. This approach also avoids multiple types in single column generated by coprocessor script.

Similar approaches:

  • Snowflake's user defined function, in dynamic languages like javascript, requires type declaration for its output: https://docs.snowflake.com/en/sql-reference/udf-overview.html

sunng87 avatar Oct 29 '22 10:10 sunng87

Yeah python itself support type annotation, and during coding coprocessor, I remember adding support using those type annotation for RecordBatch, I think a

def a() -> i64:
  return 1

is still supported, I am gonna check the code and update testcases?

discord9 avatar Oct 31 '22 04:10 discord9

@discord9 because of this line https://github.com/GreptimeTeam/greptimedb/blob/develop/src/script/src/python/engine.rs#L41

In fact, we already generated the new schema for coprocessor execution result, but we forgot to replace it here.

killme2008 avatar Oct 31 '22 07:10 killme2008