spark [SPARK-46654][SQL] Make to_csv can correctly display complex types data

What changes were proposed in this pull request?

The pr aims to fix to_cvs displaying complex types (ArrayType, MapType, StructType) incorrectly.

Why are the changes needed?

    val rows = new java.util.ArrayList[Row]()
    rows.add(Row(1L, Row(2L, "Alice", Array(100L, 200L, 300L))))

    val schema = StructType(Seq(
      StructField("key", LongType),
      StructField("value",
        StructType(Seq(
          StructField("age", LongType),
          StructField("name", StringType),
          StructField("scores", ArrayType(LongType))))
      )
    ))

    val df = spark.createDataFrame(rows, schema)
    val actual = df.select(to_csv($"value"))

Before:

+--------------------------------------------------------------------------+
|to_csv(value)                                                             |
+--------------------------------------------------------------------------+
|2,Alice,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@99c5e30f|
+--------------------------------------------------------------------------+

After:

+-------------------------+
|to_csv(value)            |
+-------------------------+
|2,Alice,"[100, 200, 300]"|
+-------------------------+

Does this PR introduce any user-facing change?

Yes,

How was this patch tested?

Add new UT.
Pass GA.

Was this patch authored or co-authored using generative AI tooling?

No.

Jan 10 '24 13:01 panbingkun

cc @cloud-fan @LuciferYang

Jan 11 '24 02:01 panbingkun

@panbingkun Let's re-enable the doctest, this test was pass in pyspark-sql task and failed in pyspark-connect task before

https://github.com/apache/spark/blob/514ecc6fc183d7222b9dc299af4df328c71966d1/python/pyspark/sql/functions/builtin.py#L15053-L15058

and remove the TODO

https://github.com/apache/spark/blob/514ecc6fc183d7222b9dc299af4df328c71966d1/python/pyspark/sql/functions/builtin.py#L15010-L15013

also cc @HyukjinKwon

Jan 11 '24 03:01 LuciferYang

@panbingkun Let's re-enable the doctest

https://github.com/apache/spark/blob/514ecc6fc183d7222b9dc299af4df328c71966d1/python/pyspark/sql/functions/builtin.py#L15053-L15058

and remove the TODO

https://github.com/apache/spark/blob/514ecc6fc183d7222b9dc299af4df328c71966d1/python/pyspark/sql/functions/builtin.py#L15010-L15013

also cc @HyukjinKwon

Okay, done.

Jan 11 '24 03:01 panbingkun

I think the current issue is that the original PySpark can display complex struct types,

bin/pyspark
Python 3.11.1 (v3.11.1:a7a450f84a, Dec  6 2022, 15:24:06) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/04 14:14:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/02/04 14:14:41 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
      /_/

Using Python version 3.11.1 (v3.11.1:a7a450f84a, Dec  6 2022 15:24:06)
Spark context Web UI available at http://localhost:4041
Spark context available as 'sc' (master = local[*], app id = local-1707027281550).
SparkSession available as 'spark'.
>>> from pyspark.sql import Row, functions as sf
>>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(sf.to_csv(df.value)).show(truncate=False)
+-----------------------+                                                       
|to_csv(value)          |
+-----------------------+
|2,Alice,"[100,200,300]"|
+-----------------------+

but PyConnect and Scala code cannot. Ultimately, we should make them consistent. Should we consider removing the current capability of PySpark(Perhaps other scenarios are not supported?)? However, this might be a breaking change.

Feb 04 '24 06:02 LuciferYang

Can the from_csv function parse the output of to_csv() back? Could you add tests or modify existing tests, please.

For this case, currently the from_csv function cann't parse the output of to_csv() back. But I think after that, we can improve from_csv so that it can read back.

Feb 04 '24 08:02 panbingkun

Does CSV source support this? I think it's disallowed

If we want to support this it should better be able to read it back. Problem is that CSV spec doesn't have this.

Should probably just disallow this case

I have privately discussed this issue with @LuciferYang , and in response to this situation, we actually have two options:

In to_csv When encountering complex types in CSV, such as ArrayType, MapType, StructType, we can prompt an error indicating that it is not supported.

There are two things that may seem strange: 1.In this case, in the df. show(), it can actually display complex types, which looks weird.

2.Furthermore, from the implementation of the function to_csv: https://github.com/apache/spark/blob/ed6fe4fccabe8068b3d1e1365e87b51c66908474/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L844 https://github.com/apache/spark/blob/ed6fe4fccabe8068b3d1e1365e87b51c66908474/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala#L245

What it means is: converting from Struct to CSV, but we don't support nested StructTypes internally, doesn't it feel a bit strange?
Another way is for us to support it. Indeed, CSV spec doesn't have this.

Feb 04 '24 08:02 panbingkun

I think we should simply disallow this. We could have a legacy conf, but it doesn't much make sense. What does it return for ArrayType and MapType?

Feb 05 '24 05:02 HyukjinKwon

I think we should simply disallow this. We could have a legacy conf, but it doesn't much make sense. What does it return for ArrayType and MapType?

Okay, I have overloaded checkInputDataTypes for to_csv(StructsToCsv) and executed type checking on it. Let's explicitly indicate that it does not support complex type data.

Feb 18 '24 08:02 panbingkun

I think we should simply disallow this. We could have a legacy conf, but it doesn't much make sense. What does it return for ArrayType and MapType?

Should we add the legacy conf? It's hard to confirm whether any users have relied on this erroneous behavior, and directly disallowing it may bring migration costs.

Feb 19 '24 05:02 LuciferYang

Would you please add [PYTHON] to title considering it has user-facing python change? Thanks!

Feb 22 '24 19:02 xinrong-meng

Would you please add [PYTHON] to title considering it has user-facing python change? Thanks!

Sure, done. 😄

Feb 23 '24 01:02 panbingkun

Is this considered a bug fix? Do we need to backport it to branch-3.5/3.4

Feb 23 '24 03:02 LuciferYang

Is this considered a bug fix? Do we need to backport it to branch-3.5/3.4

Yea, I think we need to backport it to branch-3.5/3.4.

Feb 23 '24 10:02 panbingkun

friendly ping @HyukjinKwon, When you are not busy, can you please continue to help review this PR?

Mar 08 '24 06:03 panbingkun

+1, LGTM. Merging to master. Thank you, @panbingkun and @HyukjinKwon @LuciferYang @srowen for review.

Mar 13 '24 12:03 MaxGekk

why do we remove a working feature? what's wrong with to_csv generating non-standard but pretty strings for these values?

Mar 20 '24 05:03 cloud-fan

generating non-standard but pretty strings

@cloud-fan It prints "address" of objects, not values. Do you have an example which you worry about?

Mar 20 '24 06:03 MaxGekk

@panbingkun I forgot to ask you. Does to_json have the same issue?

Mar 20 '24 06:03 MaxGekk

to_json should be fine because it has nested type representation

Mar 20 '24 07:03 HyukjinKwon

So we make a breaking change simply because we don't like the object address in the string? can we fix it?

Mar 20 '24 10:03 cloud-fan

So we make a breaking change simply because we don't like the object address in the string? can we fix it?

These types cannot be read back through 'from_csv' after generating non-standard but pretty strings through 'to_csv'.
If a type data implements toString, it can display non-standard but pretty strings, otherwise it will display the object address.
In addition, the behavior is inconsistent between Scala and Python.

Or do we need to add configuration to bring it back?

Mar 20 '24 11:03 panbingkun

to_json should be fine because it has nested type representation

Yeah, there is no standard for nested types in csv format, but the json format has a standard.

Mar 20 '24 11:03 panbingkun

You never know how people will use Spark and I'm not sure why it's necessary to ban this use case. Having inconsistency between scala and python is bad. Can we fix this problem by defining a pretty string format for struct/array/map in the csv writer?

Mar 20 '24 14:03 cloud-fan

You never know how people will use Spark and I'm not sure why it's necessary to ban this use case.

I believe it is not too bad to ban such output in any case (this is even insecure). @cloud-fan Do you see any use-case for that?

Array:

    val valueSchema = StructType(Seq(StructField("age", LongType), StructField("name", StringType), StructField("scores", ArrayType(LongType))))
    val schema = StructType(Seq(StructField("key", LongType), StructField("value", valueSchema)))
    val df = spark.createDataFrame(rows, schema)
    df.select(to_csv($"value")).show(false)


+--------------------------------------------------------------------------+
|to_csv(value)                                                             |
+--------------------------------------------------------------------------+
|2,Alice,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@e7bb2e54|
+--------------------------------------------------------------------------+

Map:

+------------------------------------------------------------------------+
|to_csv(value)                                                           |
+------------------------------------------------------------------------+
|2,Alice,org.apache.spark.sql.catalyst.expressions.UnsafeMapData@65ab9790|
+------------------------------------------------------------------------+

Struct:

    val rows = new java.util.ArrayList[Row]()
    rows.add(Row(1L, Row(2L, "Alice", Row(100L, 200L, null))))
    val valueSchema = StructType(Seq(StructField("age", LongType), StructField("name", StringType), StructField("scores", StructType(Seq(StructField("id1", LongType), StructField("id2", LongType), StructField("id3", LongType))))))
    val schema = StructType(Seq(StructField("key", LongType), StructField("value", valueSchema)))
    val df = spark.createDataFrame(rows, schema)

+---------------------+
|to_csv(value)        |
+---------------------+
|2,Alice,"[4,64,c8,0]"|
+---------------------+

Mar 20 '24 15:03 MaxGekk

Why is the result displayed through to_csv inconsistency in Scala and Python for this case? Because this case is on the python side, it ultimately uses GenericArrayData, which happens to implement the method toString, so to_csv displays readable text. https://github.com/apache/spark/blob/11247d804cd370aaeb88736a706c587e7f5c83b3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L85

However, on the scala side, it ultimately uses UnsafeArrayData. Unfortunately, it does not implement the method toString (using the default Object.toString method), so the final to_csv displays the address of the object.

In the implementation process of this PR, it can display non-standard but pretty strings, as follows: https://github.com/apache/spark/pull/44665/commits/9695e975f3299556e7c268918ecd51be7a03c157 The disadvantage of this is that it cannot be read back through from_csv at present. If the final result of the discussion is acceptable, it should be easy to bring back this feature.
Another possible compromise solution is to add a configuration (defaultly, it does not support displaying data of type [Array, Map, Struct ...] as non-standard but pretty strings through to_csv). If the user sets this configuration to be enabled, restore the original behavior?

Mar 21 '24 00:03 panbingkun

I'd prefer to fix the inconsistency. Can we use ToPrettyString.eval to generate pretty strings for these types in to_csv?

Mar 21 '24 03:03 cloud-fan

Use ToPrettyString.eval to generate pretty strings for these types in to_csv, I have also tried it in this PR. Below: https://github.com/apache/spark/pull/44665/commits/22a7afb6951b9274cc428f71e675df233de74a8a (PS: Maybe some details about using ToPrettyString.eval still need to be confirmed.)

If the final discussion result is to fix their consistency, this approach should also be possible. WDYT @MaxGekk @HyukjinKwon @LuciferYang @xinrong-meng @srowen ?

Mar 21 '24 03:03 panbingkun

I don't oppose this resolution, but if to_csv can handle complex data structures, what would our subsequent plans be? Would from_csv also need to support complex data structures to make these paired functions more self-consistent? In the long term, do we need to enhance the read and write capabilities for CSV data sources to support complex data structures?

Mar 21 '24 04:03 LuciferYang

It's ok to not support round-trip. array/struct/map is not CSV standard anyway. I think the principle is to avoid breaking changes if possible. Since to_csv already supports array/struct/map, let's keep it but fix the inconsistency issue.

Mar 21 '24 05:03 cloud-fan

If no objections, I'll revert this tomorrow. We can have a new PR to add pretty string for array/struct/map in to_csv.

Mar 21 '24 14:03 cloud-fan

spark spark copied to clipboard

[SPARK-46654][SQL] Make to_csv can correctly display complex types data

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

spark
spark copied to clipboard