Read/write timestamps
Reading/writing dataframes to/from avro files with timestamps with spark gives an inconsistent behavior. The point being that the conversion between timestamps and integers has a different logic in spark vs. what happens working with avro: spark-avro converts timestamps to long int in milliseconds; however, casting them back to "timestamp" in Spark gives wildly wrong dates, as Spark expects timestamps in seconds, I believe (dividing by 1000 after you read the data is an obvious workaraound).
Example:
scala> val df = Seq("2017-05-05", "2016-04-02").toDF.select($"value".cast("timestamp").alias("time"))
scala> df.printSchema()
root
|-- ts: timestamp (nullable = true)
scala> df.show(truncate=false)
+---------------------+
|ts |
+---------------------+
|2017-05-05 00:00:00.0|
|2016-04-02 00:00:00.0|
+---------------------+
scala> df.avro.write("test.avro")
scala> spark.read.avro("test.avro").show()
+-------------+
| ts|
+-------------+
|1493935200000|
|1459548000000|
+-------------+
scala> spark.read.avro("test.avro").select($"ts".cast("timestamp")).show(truncate=false)
+----------------------+
| ts|
+----------------------+
|49310-12-03 17:00:00.0|
|48221-03-27 18:00:00.0|
+----------------------+
scala> spark.read.avro("test.avro").select(($"ts"/1000).cast("timestamp")).show(truncate=false)
+------------------------------+
|CAST((ts / 1000) AS TIMESTAMP)|
+------------------------------+
|2017-05-05 00:00:00.0 |
|2016-04-02 00:00:00.0 |
+------------------------------+
I'm wondering if there's any better way to handle this behavior than having to manually divide the timestamp fields by 1000 when you read from avro.
We're being bitten by this bug, too.
I agree. should be a better way to handle this. It is very easy to make mistakes with the current handling of timestamps.
Any idea when this bug fix is getting released?
Facing this issues using spark-avro 4.0.0. Please let us know the release date.
Also facing this issue writing timestamp columns from a spark dataframe.
It looks like this issue was resolved already in master. Any chance we could get a new release for accessing via maven/sbt?