petastorm
petastorm copied to clipboard
Support reading SparseVectors and Vectors
Would be very useful if petastorm can read these two data types from HDFS.
Agreed. We can try and add. Not sure about the time-frame for this though.
Efficient conversion requires Scala UDFs. Maybe we should add utility methods to Spark so in petastorm we can do the following:
from pyspark.ml.functions import vector_to_dense_array
df.select(vector_to_dense_array(col("features")).alias("features"))
This approach doesn't require Scala code in petastorm. Created a Spark JIRA: https://issues.apache.org/jira/browse/SPARK-30154.
cc: @WeichenXu123
FYI. The UDF was merged into Spark master: https://github.com/apache/spark/pull/26910