spark-avro icon indicating copy to clipboard operation
spark-avro copied to clipboard

Read/write timestamps

Open leotac opened this issue 8 years ago • 6 comments

Reading/writing dataframes to/from avro files with timestamps with spark gives an inconsistent behavior. The point being that the conversion between timestamps and integers has a different logic in spark vs. what happens working with avro: spark-avro converts timestamps to long int in milliseconds; however, casting them back to "timestamp" in Spark gives wildly wrong dates, as Spark expects timestamps in seconds, I believe (dividing by 1000 after you read the data is an obvious workaraound).

Example:

scala> val df = Seq("2017-05-05", "2016-04-02").toDF.select($"value".cast("timestamp").alias("time"))

scala> df.printSchema()
root
|-- ts: timestamp (nullable = true)

scala> df.show(truncate=false)
+---------------------+
|ts                   |
+---------------------+
|2017-05-05 00:00:00.0|
|2016-04-02 00:00:00.0|
+---------------------+

scala> df.avro.write("test.avro")

scala> spark.read.avro("test.avro").show()
+-------------+
|           ts|
+-------------+
|1493935200000|
|1459548000000|
+-------------+

scala> spark.read.avro("test.avro").select($"ts".cast("timestamp")).show(truncate=false)
+----------------------+
|                    ts|
+----------------------+
|49310-12-03 17:00:00.0|
|48221-03-27 18:00:00.0|
+----------------------+

scala> spark.read.avro("test.avro").select(($"ts"/1000).cast("timestamp")).show(truncate=false)
+------------------------------+
|CAST((ts / 1000) AS TIMESTAMP)|
+------------------------------+
|2017-05-05 00:00:00.0         |
|2016-04-02 00:00:00.0         |
+------------------------------+

I'm wondering if there's any better way to handle this behavior than having to manually divide the timestamp fields by 1000 when you read from avro.

leotac avatar May 09 '17 10:05 leotac

We're being bitten by this bug, too.

nmaquet avatar Sep 07 '17 22:09 nmaquet

I agree. should be a better way to handle this. It is very easy to make mistakes with the current handling of timestamps.

ezhaar avatar Sep 18 '17 06:09 ezhaar

Any idea when this bug fix is getting released?

tamizhgeek avatar Feb 13 '18 15:02 tamizhgeek

Facing this issues using spark-avro 4.0.0. Please let us know the release date.

Srinathc avatar Apr 27 '18 09:04 Srinathc

Also facing this issue writing timestamp columns from a spark dataframe.

rondefreitas avatar Sep 19 '18 16:09 rondefreitas

It looks like this issue was resolved already in master. Any chance we could get a new release for accessing via maven/sbt?

rondefreitas avatar Sep 19 '18 17:09 rondefreitas