greptimedb
greptimedb copied to clipboard
Incorrect schema returned from coprocessor generated recordbatches
This bug was detected when refactoring our http ourtput format in #361, when using schema returned by RecordBatches
and RecordBatchStream
in http response.
In CoprStream
, the recordbatch stream data type generated from script execution, the schema returned from its stream()
function was directly copied from its input. And since the output column from coprocessor script has its own name and data type, this schema is incorrect for most cases.
In current coprocessor scenario, the data type of output is only available after the output data is generated. There is a gen_schema
function that generates schema from output data, for each column processed. And because of python's dynamic typed nature, it's even possible to have different types of data in one column.
Proposed solutions
Add data type declaration to the @copr
decorator, at least for its output, so we can generate output schema ahead of time. This approach also avoids multiple types in single column generated by coprocessor script.
Similar approaches:
- Snowflake's user defined function, in dynamic languages like javascript, requires type declaration for its output: https://docs.snowflake.com/en/sql-reference/udf-overview.html
Yeah python itself support type annotation, and during coding coprocessor, I remember adding support using those type annotation for RecordBatch, I think a
def a() -> i64:
return 1
is still supported, I am gonna check the code and update testcases?
@discord9 because of this line https://github.com/GreptimeTeam/greptimedb/blob/develop/src/script/src/python/engine.rs#L41
In fact, we already generated the new schema for coprocessor execution result, but we forgot to replace it here.