Compute unique values (aggregate) over cells and group

Open vpipkt opened this issue 6 years ago • 0 comments

Data scientist feedback suggests adding something like

rf.agg(rf_agg_unique(rf.tile))

It should compute a single Row with an ArrayType col having all distinct values in all cells in the column.

Alternative might be rf.select(rf_unique(rf.tile)) which would transform a tile to ArrayType with only distinct values.

One possibility to work around this now might be df.select(rf_explode_tiles(tile).alias('cell_vals').select('cell_vals').distinct()

Other related discussions:

a method for getting unique from across ArrayType in spark: https://stackoverflow.com/questions/37801889/get-the-distinct-elements-of-an-arraytype-column-in-a-spark-dataframe

numpy unique https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html

Jul 26 '19 14:07 vpipkt