databend icon indicating copy to clipboard operation
databend copied to clipboard

Support JSON in new expression framework

Open andylokandy opened this issue 2 years ago • 14 comments

Tasks

  1. Add DataType::Variant.

The parsing and display rule:

"JSON" => DataType::Variant
"VARIANT" => DataType::Variant

DataType::Variant => "VARIANT"
  1. Rename generic type AnyType to VariantType, and implement ArgType for it:
ArgType::Scalar = common_expression::values::Scalar
ArgType::ScalarRef<'a> = common_expression::values::ScalarRef<'a>
ArgType::Column = Arc<[Scalar]>
ArgType::Domain = ()  // upcasts to `Domain::Undefined`
ArgType::ColumnBuilder = Vec<Scalar>
ArgType::ColumnIterator<'a> = std::vec::Iter<'a, Scalar>
  1. Add conversion between Scalar and serde_json::Value. And then store Variant columns in arrow2's BinaryArray<i64> which stores serialized JSON strings.

  2. Add can_auto_cast_to rules:

DataType::_ -> DataType::Variant
  1. Add CAST rule:
DataType::_ -> DataType::Variant

and TRY_CAST rule:

// Must be successful
DataType::_ -> DataType::Nullable(DataType::Variant)

// Extract exact variant from JSON value, convert to NULL if type mismatches.
DataType::Variant -> DataType::Nullable(_)

andylokandy avatar Jul 14 '22 17:07 andylokandy

@andylokandy hi, I am here again. I will take this issue

jiaoew1991 avatar Jul 25 '22 11:07 jiaoew1991

json is not like number type, its structure and expression is quit complicated, could you provide more information about this issue, like how to display it, the relationship with string type, does arrow support it , etc.

@andylokandy

jiaoew1991 avatar Jul 25 '22 11:07 jiaoew1991

@b41sh is the original author of the JSON type, ping for more help :)

BohuTANG avatar Jul 25 '22 11:07 BohuTANG

@jiaoew1991 Glad to see you again! I've added some instructions to the issue, please let me know if anything is not clear. 😉

andylokandy avatar Jul 26 '22 06:07 andylokandy

some advice @andylokandy @jiaoew1991 It's better to use DataType::Variant and DataType::VariantObject, with JSON and Object as an alias, because Variant is more general. We need to define a struct VariantValue, as a wrapper serde_json::Value, for two reasons:

  1. serde_json::Value does not implement the trait Ord and PartialOrd, which is needed when sorting.
  2. the performance of serde_json::Value is not good, we may replace it with other formats in the future.

b41sh avatar Jul 26 '22 13:07 b41sh

I think we can just implement Variant, aka. Json, for now. Because I'm going to implement a static typed Map<T> so that JsonObject is semantically equal to Map<Variant>.

andylokandy avatar Jul 26 '22 14:07 andylokandy

@b41sh Can we replace serde_json::Value with common_expression::values::Scalar?

andylokandy avatar Jul 26 '22 14:07 andylokandy

I think we can just implement Variant, aka. Json, for now. Because I'm going to implement a static typed Map<T> so that JsonObject is semantically equal to Map<Variant>.

I agree with you.

b41sh avatar Jul 26 '22 16:07 b41sh

@b41sh Can we replace serde_json::Value with common_expression::values::Scalar?

common_expression::values::Scalar can indeed represent any type of data, it is a superset of serde_json::Value. But in this case, we need to add additional methods to parse the raw JSON text to Scalar, and also encode Scalar into a format suitable to store in arrow column. This would make the Scalar too complex, so I think it's more appropriate to define a different data type.

b41sh avatar Jul 26 '22 17:07 b41sh

FYI: Static typed map is added in https://github.com/datafuselabs/databend/pull/6838

andylokandy avatar Jul 26 '22 20:07 andylokandy

@b41sh Scalar can be able to convert between serde_json::Value so that the serialization/deserialization will not be a big problem.

andylokandy avatar Jul 27 '22 06:07 andylokandy

@b41sh @jiaoew1991 I've updated the tasks, PTAL

andylokandy avatar Jul 27 '22 10:07 andylokandy

@b41sh @jiaoew1991 I've updated the tasks, PTAL

@andylokandy Got it, it looks more concise and clear

jiaoew1991 avatar Jul 27 '22 10:07 jiaoew1991

Hmmm, seems that AnyType can not be reused because they have different column types. We may have to add a new VariantType

andylokandy avatar Jul 27 '22 12:07 andylokandy