[VL] Support map_concat spark function
Adds support for map_concat spark function as part of #4039
Added supporting test case and updated documentation
Thanks for opening a pull request!
Could you open an issue for this pull request on Github Issues?
https://github.com/apache/incubator-gluten/issues
Then could you also rename commit message and pull request title in the following format?
[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}
See also:
Run Gluten Clickhouse CI
Run Gluten Clickhouse CI
df.collect() works as expected for map_concat function. (converts to velox plan)
df.show() is still executing in Spark due to the fallback. The reason of fallback is, Spark adds limit of 20 for df.show and tries to cast all the columns to string data type.
(Code snippet from Dataset.scala)
private[sql] def getRows(
numRows: Int,
truncate: Int): Seq[Seq[String]] = {
val newDf = toDF()
val castCols = newDf.logicalPlan.output.map { col =>
// Since binary types in top-level schema fields have a specific format to print,
// so we do not cast them to strings here.
if (col.dataType == BinaryType) {
Column(col)
} else {
Column(col).cast(StringType)
}
}
The CAST function is then transformed into Gluten to CASTTransformer but fails in validation since the cast is not supported for map type in Gluten yet. (Code snippet from SubstraitToVeloxPlanValidator.cc)
switch (input->type()->kind()) {
case TypeKind::ARRAY:
case TypeKind::MAP:
case TypeKind::ROW:
case TypeKind::VARBINARY:
LOG_VALIDATION_MSG("Invalid input type in casting: ARRAY/MAP/ROW/VARBINARY.");
return false;
case TypeKind::TIMESTAMP: {
LOG_VALIDATION_MSG("Casting from TIMESTAMP is not supported or has incorrect result.");
return false;
}
default: {
}
}
@rui-mo can you please check this patch again. I have added few comments and rewritten the spark test case since there is no exception thrown for duplicate values. Velox already handles it by keeping last value.
Hi @Surbhi-Vijay, I notice MapKeyDedupPolicy.EXCEPTION seems to be the default behavior of Spark, is that right? In your opinion, do we need to customize the map_concact function in Velox to allow configurable behavior?
Hi @Surbhi-Vijay, I notice
MapKeyDedupPolicy.EXCEPTIONseems to be the default behavior of Spark, is that right? In your opinion, do we need to customize the map_concact function in Velox to allow configurable behavior?
Yes, MapKeyDedupPolicy.EXCEPTION is the default behavior in spark. Exception will have to thrown from Velox if the behavior should be same as Spark. The velox implementation is based on Presto where LAST_WIN is the default behavior.
Maybe we can introduce change in Velox to throw exception based on a config which will only be used in Gluten.
Please let me know your thoughts.
Hi @Surbhi-Vijay, I notice
MapKeyDedupPolicy.EXCEPTIONseems to be the default behavior of Spark, is that right? In your opinion, do we need to customize the map_concact function in Velox to allow configurable behavior?Yes,
MapKeyDedupPolicy.EXCEPTIONis the default behavior in spark. Exception will have to thrown from Velox if the behavior should be same as Spark. The velox implementation is based on Presto where LAST_WIN is the default behavior.Maybe we can introduce change in Velox to throw exception based on a config which will only be used in Gluten.
Please let me know your thoughts.
@rui-mo @PHILO-HE can you please check this comment? and we can finalize the next set of action for this patch.
Maybe we can introduce change in Velox to throw exception based on a config which will only be used in Gluten. Please let me know your thoughts.
@Surbhi-Vijay, yes, better to introduce a config in Velox to support those two behaviors. I guess Spark's default EXCEPTION behavior is commonly used, so please firstly fix it in Velox. cc @rui-mo