incubator-gluten icon indicating copy to clipboard operation
incubator-gluten copied to clipboard

[VL] Support map_concat spark function

Open Surbhi-Vijay opened this issue 1 year ago • 11 comments

Adds support for map_concat spark function as part of #4039

Added supporting test case and updated documentation

Surbhi-Vijay avatar Mar 24 '24 07:03 Surbhi-Vijay

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

github-actions[bot] avatar Mar 24 '24 07:03 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Mar 24 '24 07:03 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Mar 27 '24 13:03 github-actions[bot]

df.collect() works as expected for map_concat function. (converts to velox plan)

df.show() is still executing in Spark due to the fallback. The reason of fallback is, Spark adds limit of 20 for df.show and tries to cast all the columns to string data type. (Code snippet from Dataset.scala)

private[sql] def getRows(
      numRows: Int,
      truncate: Int): Seq[Seq[String]] = {
    val newDf = toDF()
    val castCols = newDf.logicalPlan.output.map { col =>
      // Since binary types in top-level schema fields have a specific format to print,
      // so we do not cast them to strings here.
      if (col.dataType == BinaryType) {
        Column(col)
      } else {
        Column(col).cast(StringType)
      }
    }

The CAST function is then transformed into Gluten to CASTTransformer but fails in validation since the cast is not supported for map type in Gluten yet. (Code snippet from SubstraitToVeloxPlanValidator.cc)

switch (input->type()->kind()) {
    case TypeKind::ARRAY:
    case TypeKind::MAP:
    case TypeKind::ROW:
    case TypeKind::VARBINARY:
      LOG_VALIDATION_MSG("Invalid input type in casting: ARRAY/MAP/ROW/VARBINARY.");
      return false;
    case TypeKind::TIMESTAMP: {
      LOG_VALIDATION_MSG("Casting from TIMESTAMP is not supported or has incorrect result.");
      return false;
    }
    default: {
    }
  }

Surbhi-Vijay avatar Mar 27 '24 13:03 Surbhi-Vijay

@rui-mo can you please check this patch again. I have added few comments and rewritten the spark test case since there is no exception thrown for duplicate values. Velox already handles it by keeping last value.

Surbhi-Vijay avatar Mar 27 '24 13:03 Surbhi-Vijay

Hi @Surbhi-Vijay, I notice MapKeyDedupPolicy.EXCEPTION seems to be the default behavior of Spark, is that right? In your opinion, do we need to customize the map_concact function in Velox to allow configurable behavior?

rui-mo avatar Mar 28 '24 01:03 rui-mo

Hi @Surbhi-Vijay, I notice MapKeyDedupPolicy.EXCEPTION seems to be the default behavior of Spark, is that right? In your opinion, do we need to customize the map_concact function in Velox to allow configurable behavior?

Yes, MapKeyDedupPolicy.EXCEPTION is the default behavior in spark. Exception will have to thrown from Velox if the behavior should be same as Spark. The velox implementation is based on Presto where LAST_WIN is the default behavior.

Maybe we can introduce change in Velox to throw exception based on a config which will only be used in Gluten.

Please let me know your thoughts.

Surbhi-Vijay avatar Mar 28 '24 06:03 Surbhi-Vijay

Hi @Surbhi-Vijay, I notice MapKeyDedupPolicy.EXCEPTION seems to be the default behavior of Spark, is that right? In your opinion, do we need to customize the map_concact function in Velox to allow configurable behavior?

Yes, MapKeyDedupPolicy.EXCEPTION is the default behavior in spark. Exception will have to thrown from Velox if the behavior should be same as Spark. The velox implementation is based on Presto where LAST_WIN is the default behavior.

Maybe we can introduce change in Velox to throw exception based on a config which will only be used in Gluten.

Please let me know your thoughts.

@rui-mo @PHILO-HE can you please check this comment? and we can finalize the next set of action for this patch.

Surbhi-Vijay avatar Apr 04 '24 11:04 Surbhi-Vijay

Maybe we can introduce change in Velox to throw exception based on a config which will only be used in Gluten. Please let me know your thoughts.

@Surbhi-Vijay, yes, better to introduce a config in Velox to support those two behaviors. I guess Spark's default EXCEPTION behavior is commonly used, so please firstly fix it in Velox. cc @rui-mo

PHILO-HE avatar Apr 07 '24 07:04 PHILO-HE