petastorm icon indicating copy to clipboard operation
petastorm copied to clipboard

Support reading SparseVectors and Vectors

Open abditag2 opened this issue 5 years ago • 3 comments

Would be very useful if petastorm can read these two data types from HDFS.

abditag2 avatar Sep 05 '19 06:09 abditag2

Agreed. We can try and add. Not sure about the time-frame for this though.

selitvin avatar Sep 10 '19 06:09 selitvin

Efficient conversion requires Scala UDFs. Maybe we should add utility methods to Spark so in petastorm we can do the following:

from pyspark.ml.functions import vector_to_dense_array
df.select(vector_to_dense_array(col("features")).alias("features"))

This approach doesn't require Scala code in petastorm. Created a Spark JIRA: https://issues.apache.org/jira/browse/SPARK-30154.

cc: @WeichenXu123

mengxr avatar Dec 06 '19 17:12 mengxr

FYI. The UDF was merged into Spark master: https://github.com/apache/spark/pull/26910

mengxr avatar Jan 07 '20 01:01 mengxr