spark
spark copied to clipboard
[SPARK-46654][SQL] Make to_csv can correctly display complex types data
What changes were proposed in this pull request?
The pr aims to fix to_cvs
displaying complex types (ArrayType, MapType, StructType) incorrectly.
Why are the changes needed?
val rows = new java.util.ArrayList[Row]()
rows.add(Row(1L, Row(2L, "Alice", Array(100L, 200L, 300L))))
val schema = StructType(Seq(
StructField("key", LongType),
StructField("value",
StructType(Seq(
StructField("age", LongType),
StructField("name", StringType),
StructField("scores", ArrayType(LongType))))
)
))
val df = spark.createDataFrame(rows, schema)
val actual = df.select(to_csv($"value"))
Before:
+--------------------------------------------------------------------------+
|to_csv(value) |
+--------------------------------------------------------------------------+
|2,Alice,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@99c5e30f|
+--------------------------------------------------------------------------+
After:
+-------------------------+
|to_csv(value) |
+-------------------------+
|2,Alice,"[100, 200, 300]"|
+-------------------------+
Does this PR introduce any user-facing change?
Yes,
How was this patch tested?
- Add new UT.
- Pass GA.
Was this patch authored or co-authored using generative AI tooling?
No.
cc @cloud-fan @LuciferYang
@panbingkun Let's re-enable the doctest, this test was pass in pyspark-sql task and failed in pyspark-connect task before
https://github.com/apache/spark/blob/514ecc6fc183d7222b9dc299af4df328c71966d1/python/pyspark/sql/functions/builtin.py#L15053-L15058
and remove the TODO
https://github.com/apache/spark/blob/514ecc6fc183d7222b9dc299af4df328c71966d1/python/pyspark/sql/functions/builtin.py#L15010-L15013
also cc @HyukjinKwon
@panbingkun Let's re-enable the doctest
https://github.com/apache/spark/blob/514ecc6fc183d7222b9dc299af4df328c71966d1/python/pyspark/sql/functions/builtin.py#L15053-L15058
and remove the TODO
https://github.com/apache/spark/blob/514ecc6fc183d7222b9dc299af4df328c71966d1/python/pyspark/sql/functions/builtin.py#L15010-L15013
also cc @HyukjinKwon
Okay, done.
I think the current issue is that the original PySpark can display complex struct types,
bin/pyspark
Python 3.11.1 (v3.11.1:a7a450f84a, Dec 6 2022, 15:24:06) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/04 14:14:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/02/04 14:14:41 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 4.0.0-SNAPSHOT
/_/
Using Python version 3.11.1 (v3.11.1:a7a450f84a, Dec 6 2022 15:24:06)
Spark context Web UI available at http://localhost:4041
Spark context available as 'sc' (master = local[*], app id = local-1707027281550).
SparkSession available as 'spark'.
>>> from pyspark.sql import Row, functions as sf
>>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))]
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(sf.to_csv(df.value)).show(truncate=False)
+-----------------------+
|to_csv(value) |
+-----------------------+
|2,Alice,"[100,200,300]"|
+-----------------------+
but PyConnect and Scala code cannot. Ultimately, we should make them consistent. Should we consider removing the current capability of PySpark(Perhaps other scenarios are not supported?)? However, this might be a breaking change.
Can the
from_csv
function parse the output ofto_csv()
back? Could you add tests or modify existing tests, please.
For this case, currently the from_csv
function cann't parse the output of to_csv()
back.
But I think after that, we can improve from_csv
so that it can read back.
- Does CSV source support this? I think it's disallowed
- If we want to support this it should better be able to read it back. Problem is that CSV spec doesn't have this.
Should probably just disallow this case
I have privately discussed this issue with @LuciferYang , and in response to this situation, we actually have two options:
-
In
to_csv
When encounteringcomplex types
in CSV, such asArrayType
,MapType
,StructType
, we can prompt an error indicating that it is not supported.There are two things that may seem strange: 1.In this case, in the
df. show()
, it can actually display complex types, which looks weird.2.Furthermore, from the implementation of the function
to_csv
: https://github.com/apache/spark/blob/ed6fe4fccabe8068b3d1e1365e87b51c66908474/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L844 https://github.com/apache/spark/blob/ed6fe4fccabe8068b3d1e1365e87b51c66908474/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala#L245What it means is:
converting from Struct to CSV
, but we don't supportnested StructTypes
internally, doesn't it feel a bit strange? -
Another way is for us to support it. Indeed, CSV spec doesn't have this.
I think we should simply disallow this. We could have a legacy conf, but it doesn't much make sense. What does it return for ArrayType
and MapType
?
I think we should simply disallow this. We could have a legacy conf, but it doesn't much make sense. What does it return for
ArrayType
andMapType
?
Okay, I have overloaded checkInputDataTypes
for to_csv
(StructsToCsv
) and executed type checking on it. Let's explicitly
indicate that it does not support complex type data
.
I think we should simply disallow this. We could have a legacy conf, but it doesn't much make sense. What does it return for
ArrayType
andMapType
?
Should we add the legacy conf? It's hard to confirm whether any users have relied on this erroneous behavior, and directly disallowing it may bring migration costs.
Would you please add [PYTHON]
to title considering it has user-facing python change? Thanks!
Would you please add
[PYTHON]
to title considering it has user-facing python change? Thanks!
Sure, done. 😄
Is this considered a bug fix? Do we need to backport it to branch-3.5/3.4
Is this considered a bug fix? Do we need to backport it to branch-3.5/3.4
Yea, I think we need to backport it to branch-3.5/3.4
.
friendly ping @HyukjinKwon, When you are not busy, can you please continue to help review this PR?
+1, LGTM. Merging to master. Thank you, @panbingkun and @HyukjinKwon @LuciferYang @srowen for review.
why do we remove a working feature? what's wrong with to_csv
generating non-standard but pretty strings for these values?
generating non-standard but pretty strings
@cloud-fan It prints "address" of objects, not values. Do you have an example which you worry about?
@panbingkun I forgot to ask you. Does to_json
have the same issue?
to_json
should be fine because it has nested type representation
So we make a breaking change simply because we don't like the object address in the string? can we fix it?
So we make a breaking change simply because we don't like the object address in the string? can we fix it?
- These types cannot be read back through 'from_csv' after generating non-standard but pretty strings through 'to_csv'.
- If a type data implements
toString
, it can display non-standard but pretty strings, otherwise it will display the object address. - In addition, the behavior is inconsistent between Scala and Python.
Or do we need to add configuration to bring it back?
to_json
should be fine because it has nested type representation
Yeah, there is no standard for nested types in csv
format, but the json
format has a standard.
You never know how people will use Spark and I'm not sure why it's necessary to ban this use case. Having inconsistency between scala and python is bad. Can we fix this problem by defining a pretty string format for struct/array/map in the csv writer?
You never know how people will use Spark and I'm not sure why it's necessary to ban this use case.
I believe it is not too bad to ban such output in any case (this is even insecure). @cloud-fan Do you see any use-case for that?
Array:
val valueSchema = StructType(Seq(StructField("age", LongType), StructField("name", StringType), StructField("scores", ArrayType(LongType))))
val schema = StructType(Seq(StructField("key", LongType), StructField("value", valueSchema)))
val df = spark.createDataFrame(rows, schema)
df.select(to_csv($"value")).show(false)
+--------------------------------------------------------------------------+
|to_csv(value) |
+--------------------------------------------------------------------------+
|2,Alice,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@e7bb2e54|
+--------------------------------------------------------------------------+
Map:
+------------------------------------------------------------------------+
|to_csv(value) |
+------------------------------------------------------------------------+
|2,Alice,org.apache.spark.sql.catalyst.expressions.UnsafeMapData@65ab9790|
+------------------------------------------------------------------------+
Struct:
val rows = new java.util.ArrayList[Row]()
rows.add(Row(1L, Row(2L, "Alice", Row(100L, 200L, null))))
val valueSchema = StructType(Seq(StructField("age", LongType), StructField("name", StringType), StructField("scores", StructType(Seq(StructField("id1", LongType), StructField("id2", LongType), StructField("id3", LongType))))))
val schema = StructType(Seq(StructField("key", LongType), StructField("value", valueSchema)))
val df = spark.createDataFrame(rows, schema)
+---------------------+
|to_csv(value) |
+---------------------+
|2,Alice,"[4,64,c8,0]"|
+---------------------+
- Why is the result displayed through
to_csv
inconsistency in Scala and Python for this case? Because this case is on thepython side
, it ultimately usesGenericArrayData
, which happens to implement the methodtoString
, soto_csv
displays readable text. https://github.com/apache/spark/blob/11247d804cd370aaeb88736a706c587e7f5c83b3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L85
However, on the scala side
, it ultimately uses UnsafeArrayData
. Unfortunately
, it does not implement the method toString
(using the default Object.toString
method), so the final to_csv
displays the address of the object
.
-
In the implementation process of this PR, it can display
non-standard but pretty strings
, as follows: https://github.com/apache/spark/pull/44665/commits/9695e975f3299556e7c268918ecd51be7a03c157The
disadvantage
of this is that itcannot
beread back
throughfrom_csv
at present
. If the final result of the discussion is acceptable, it should be easy to bring back this feature. -
Another possible compromise solution is to add a configuration (defaultly, it does
not
support displaying data of type [Array, Map, Struct ...] asnon-standard but pretty strings
throughto_csv
). If the user sets this configuration to be enabled, restore the original behavior?
I'd prefer to fix the inconsistency. Can we use ToPrettyString.eval
to generate pretty strings for these types in to_csv
?
Use ToPrettyString.eval
to generate pretty strings for these types in to_csv
, I have also tried it in this PR. Below:
https://github.com/apache/spark/pull/44665/commits/22a7afb6951b9274cc428f71e675df233de74a8a
(PS: Maybe some details about
using ToPrettyString.eval
still need to be confirmed.)
If the final discussion result is to fix their consistency
, this approach should also be possible.
WDYT @MaxGekk @HyukjinKwon @LuciferYang @xinrong-meng @srowen ?
I don't oppose this resolution, but if to_csv
can handle complex data structures, what would our subsequent plans be? Would from_csv
also need to support complex data structures to make these paired functions more self-consistent? In the long term, do we need to enhance the read and write capabilities for CSV data sources to support complex data structures?
It's ok to not support round-trip. array/struct/map is not CSV standard anyway. I think the principle is to avoid breaking changes if possible. Since to_csv
already supports array/struct/map, let's keep it but fix the inconsistency issue.
If no objections, I'll revert this tomorrow. We can have a new PR to add pretty string for array/struct/map in to_csv
.