frameless icon indicating copy to clipboard operation
frameless copied to clipboard

Missing Columns method

Open OlivierBlanvillain opened this issue 8 years ago • 7 comments

Exhaustive status of the API implemented by frameless.TypedColumn compared to Spark's Column. It's split into two, the methods implemented directly on Columns, and the methods comings from org.apache.spark.sql.functions._

Column methods

Won't fix:

  • [ ] Column alias(String alias) inherently unsafe
  • [ ] Column apply(Object extraction) inherently unsafe
  • [ ] Column as(String alias) inherently unsafe
  • [ ] Column name(String alias) inherently unsafe

TODO / done:

  • [ ] Column asc_nulls_first()
  • [ ] Column asc_nulls_last()
  • [ ] Column desc_nulls_first()
  • [ ] Column desc_nulls_last()
  • [ ] void explain(boolean extended)
  • [ ] Column eqNullSafe(Object other)
  • [ ] Column getField(String fieldName)
  • [ ] Column getItem(Object key)
  • [ ] Column isNotNull()
  • [ ] Column isNull()
  • [ ] Column like(String literal)
  • [ ] Column over()
  • [ ] Column over(WindowSpec window)
  • [ ] Column rlike(String literal)
  • [x] Column isNaN()
  • [x] Column substr(Column startPos, Column len) (WIP #263)
  • [x] Column substr(int startPos, int len) (WIP #263)
  • [x] Column mod(Object other) (WIP #296)
  • [x] Column between(Object lowerBound, Object upperBound)
  • [x] Column multiply(Object other)
  • [x] Column endsWith(String literal)
  • [x] Column isin(Object... list) (#254)
  • [x] Column startsWith(Column other)
  • [x] Column startsWith(String literal)
  • [x] Column otherwise(Object value)
  • [x] Column when(Column condition, Object value)
  • [x] Column and(Column other)
  • [x] Column contains(Object other)
  • [x] Column or(Column other)
  • [x] Column bitwiseAND(Object other)
  • [x] Column bitwiseOR(Object other)
  • [x] Column bitwiseXOR(Object other)
  • [x] <U> TypedColumn<Object,U> as(Encoder<U> evidence$1) (as cast)
  • [x] Column asc() (as sortAscending)
  • [x] Column cast(DataType to)
  • [x] Column desc() (as sortDescending)
  • [x] Column divide(Object other)
  • [x] boolean equals(Object that) (as ===)
  • [x] Column equalTo(Object other) (as ===)
  • [x] org.apache.spark.sql.catalyst.expressions.Expression expr()
  • [x] Column geq(Object other) (as >=)
  • [x] Column gt(Object other) (as >)
  • [x] Column leq(Object other) (as <=)
  • [x] Column lt(Object other) (as <)
  • [x] Column minus(Object other)
  • [x] Column notEqual(Object other) (as =!=)
  • [x] Column plus(Object other)
  • [x] String toString()

org.apache.spark.sql.functions

TODO / done:

  • [ ] Column col(String colName) to be implemented using shapeless.Witness
  • [ ] Column add_months(Column startDate, int numMonths)
  • [ ] Column array(String colName, String... colNames)
  • [ ] Column asc_nulls_first(String columnName)
  • [ ] Column asc_nulls_last(String columnName)
  • [ ] Column asc(String columnName)
  • [ ] <T> Dataset<T> broadcast(Dataset<T> df)
  • [ ] Column ceil(String columnName)
  • [ ] Column coalesce(Column... e)
  • [ ] Column cume_dist()
  • [ ] Column current_date()
  • [ ] Column current_timestamp()
  • [ ] Column date_add(Column start, int days)
  • [ ] Column date_format(Column dateExpr, String format)
  • [ ] Column date_sub(Column start, int days)
  • [ ] Column datediff(Column end, Column start)
  • [ ] Column dayofmonth(Column e)
  • [ ] Column dayofyear(Column e)
  • [ ] Column decode(Column value, String charset)
  • [ ] Column dense_rank()
  • [ ] Column desc_nulls_first(String columnName)
  • [ ] Column desc_nulls_last(String columnName)
  • [ ] Column desc(String columnName)
  • [ ] Column encode(Column value, String charset)
  • [ ] Column expm1(String columnName)
  • [ ] Column expr(String expr)
  • [ ] Column factorial(Column e)
  • [ ] Column first(String columnName, boolean ignoreNulls)
  • [ ] Column floor(String columnName)
  • [ ] Column format_number(Column x, int d)
  • [ ] Column format_string(String format, Column... arguments)
  • [ ] Column from_json(Column e, StructType schema, scala.collection.immutable.Map<String,String> options)
  • [ ] Column from_unixtime(Column ut, String f)
  • [ ] Column from_utc_timestamp(Column ts, String tz)
  • [ ] Column get_json_object(Column e, String path)
  • [ ] Column greatest(String columnName, String... columnNames)
  • [ ] Column grouping_id(String colName, scala.collection.Seq<String> colNames)
  • [ ] Column grouping(String columnName)
  • [ ] Column hash(Column... cols)
  • [ ] Column hash(scala.collection.Seq<Column> cols)
  • [ ] Column hex(Column column)
  • [ ] Column hour(Column e)
  • [ ] Column initcap(Column e)
  • [ ] Column input_file_name()
  • [ ] Column isnan(Column e)
  • [ ] Column isnull(Column e)
  • [ ] Column json_tuple(Column json, String... fields)
  • [ ] Column lag(String columnName, int offset, Object defaultValue)
  • [ ] Column last_day(Column e)
  • [ ] Column last(String columnName, boolean ignoreNulls)
  • [ ] Column lead(String columnName, int offset, Object defaultValue)
  • [ ] Column least(String columnName, String... columnNames)
  • [ ] Column lit(Object literal)
  • [ ] Column locate(String substr, Column str, int pos)
  • [ ] Column map(Column... cols)
  • [ ] Column map(scala.collection.Seq<Column> cols)
  • [ ] Column md5(Column e)
  • [ ] Column mean(String columnName)
  • [ ] Column minute(Column e)
  • [ ] Column monotonicallyIncreasingId()
  • [ ] Column month(Column e)
  • [ ] Column months_between(Column date1, Column date2)
  • [ ] Column nanvl(Column col1, Column col2)
  • [ ] Column next_day(Column date, String dayOfWeek)
  • [ ] Column ntile(int n)
  • [ ] Column percent_rank()
  • [ ] Column posexplode(Column e)
  • [ ] Column quarter(Column e)
  • [ ] Column radians(String columnName)
  • [ ] Column rand()
  • [ ] Column rand(long seed)
  • [ ] Column randn()
  • [ ] Column randn(long seed)
  • [ ] Column rank()
  • [ ] Column regexp_extract(Column e, String exp, int groupIdx)
  • [ ] Column repeat(Column str, int n)
  • [ ] Column rint(String columnName)
  • [ ] Column round(Column e, int scale)
  • [ ] Column row_number()
  • [ ] Column second(Column e)
  • [ ] Column signum(String columnName)
  • [ ] Column sort_array(Column e, boolean asc)
  • [ ] Column soundex(Column e)
  • [ ] Column spark_partition_id()
  • [ ] Column split(Column str, String pattern)
  • [ ] Column struct(Column... cols)
  • [ ] Column struct(scala.collection.Seq<Column> cols)
  • [ ] Column struct(String colName, scala.collection.Seq<String> colNames)
  • [ ] Column struct(String colName, String... colNames)
  • [ ] Column substring_index(Column str, String delim, int count)
  • [ ] Column sumDistinct(Column e)
  • [ ] Column sumDistinct(String columnName)
  • [ ] Column to_date(Column e)
  • [ ] Column to_json(Column e, Map<String,String> options)
  • [ ] Column to_utc_timestamp(Column ts, String tz)
  • [ ] Column translate(Column src, String matchingString, String replaceString)
  • [ ] Column trunc(Column date, String format)
  • [ ] Column unbase64(Column e)
  • [ ] Column unhex(Column column)
  • [ ] Column unix_timestamp()
  • [ ] Column unix_timestamp(Column s)
  • [ ] Column unix_timestamp(Column s, String p)
  • [ ] Column var_pop(String columnName)
  • [ ] Column var_samp(String columnName)
  • [ ] Column weekofyear(Column e)
  • [ ] Column when(Column condition, Object value)
  • [ ] Column window(Column timeColumn, String windowDuration)
  • [ ] Column window(Column timeColumn, String windowDuration, String slideDuration)
  • [ ] Column window(Column timeColumn, String windowDuration, String slideDuration, String startTime)
  • [ ] Column year(Column e)
  • [x] Column conv(Column num, int fromBase, int toBase)
  • [x] Column degrees(String columnName)
  • [x] Column negate(Column e)
  • [x] Column not(Column e)
  • [x] Column hypot(String leftName, String rightName)
  • [x] Column log(double base, String columnName)
  • [x] Column log(String columnName)
  • [x] Column log10(Column e)
  • [x] Column log1p(Column e)
  • [x] Column log2(Column expr)
  • [x] Column pmod(Column dividend, Column divisor)
  • [x] Column pow(String leftName, String rightName)
  • [x] Column bround(Column e, int scale)
  • [x] Column cbrt(String columnName)
  • [x] Column crc32(Column e)
  • [x] Column exp(String columnName)
  • [x] Column sha1(Column e)
  • [x] Column sha2(Column e, int numBits)
  • [x] Column shiftLeft(Column e, int numBits)
  • [x] Column shiftRight(Column e, int numBits)
  • [x] Column shiftRightUnsigned(Column e, int numBits)
  • [x] Column sqrt(String colName)
  • [x] Column cos(String columnName)
  • [x] Column cosh(String columnName)
  • [x] Column sin(String columnName)
  • [x] Column sinh(String columnName)
  • [x] Column tan(String columnName)
  • [x] Column tanh(String columnName)
  • [x] Column approxCountDistinct(String columnName, double rsd)
  • [x] Column avg(String columnName)
  • [x] Column callUDF(String udfName, Column... cols)
  • [x] Column collect_list(String columnName) (as collectList)
  • [x] Column collect_set(String columnName) (as collectSet)
  • [x] Column corr(String columnName1, String columnName2)
  • [x] Column count(Column e)
  • [x] Column countDistinct(String columnName, String... columnNames)
  • [x] Column explode(Column e)
  • [x] Column first(String columnName)
  • [x] Column last(String columnName)
  • [x] Column max(String columnName)
  • [x] Column min(String columnName)
  • [x] Column size(Column e)
  • [x] Column stddev(String columnName)
  • [x] Column sum(Column e)
  • [x] <RT> UserDefinedFunction udf(scala.Function0<RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$1)
  • [x] <RT,A1> UserDefinedFunction udf(scala.Function1<A1,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$2, scala.reflect.api.TypeTags.TypeTag<A1> evidence$3)
  • [x] UserDefinedFunction udf(Object f, DataType dataType)
  • [x] Column variance(String columnName)
  • [x] Column stddev_pop(String columnName)
  • [x] Column stddev_samp(String columnName)
  • [x] Column covar_pop(String columnName1, String columnName2)
  • [x] Column covar_samp(String columnName1, String columnName2)
  • [x] Column kurtosis(String columnName)
  • [x] Column skewness(String columnName)
  • [x] Column abs(Column e)
  • [x] Column acos(String columnName)
  • [x] Column array_contains(Column column, Object value)
  • [x] Column ascii(Column e)
  • [x] Column asin(String columnName)
  • [x] Column atan(String columnName)
  • [x] Column atan2(String leftName, String rightName)
  • [x] Column base64(Column e)
  • [x] Column bin(String columnName)
  • [x] Column bitwiseNOT(Column e)
  • [x] Column concat_ws(String sep, Column... exprs)
  • [x] Column concat(Column... exprs)
  • [x] Column instr(Column str, String substring)
  • [x] Column length(Column e)
  • [x] Column levenshtein(Column l, Column r)
  • [x] Column lower(Column e)
  • [x] Column lpad(Column str, int len, String pad)
  • [x] Column ltrim(Column e)
  • [x] Column regexp_replace(Column e, String pattern, String replacement)
  • [x] Column reverse(Column str)
  • [x] Column rpad(Column str, int len, String pad)
  • [x] Column rtrim(Column e)
  • [x] Column substring(Column str, int pos, int len)
  • [x] Column trim(Column e)
  • [x] Column upper(Column e)

OlivierBlanvillain avatar Aug 08 '17 09:08 OlivierBlanvillain

Hi @OlivierBlanvillain ! thanks for adding this! I think some are not relevant, like anything that has to do with "null" I actually replaced all of those with "isNone" "isNotNone". I don't remember in which PR I did that and I am not sure that is merged. I have to take another look.

imarios avatar Aug 23 '17 17:08 imarios

I'm in the groove of implementing those functions anyway will start mid-late september most likely. thanks for listing them all @OlivierBlanvillain !

GrafBlutwurst avatar Aug 25 '17 09:08 GrafBlutwurst

@imarios It would be an interesting to try implementing these without affecting performance, getting there would be amazing!

@GrafBlutwurst Awesome, hopefully most them are really straightforward to implement, and with your bivariatePropTemplate/univariatePropTemplate helpers testing that is trivial!

OlivierBlanvillain avatar Aug 25 '17 11:08 OlivierBlanvillain

Just saw this ticket - @OlivierBlanvillain could you elaborate on why functions.col is unsafe? Assuming we provide a version that uses shapeless Witness and verify that the symbol exists in T

iravid avatar Sep 21 '17 19:09 iravid

Edited, nice catch! Indeed there not reason not to built a Witness powered version of that! (they are probably functions marked TODO that don't make sense, I didn't spent much time on each entry).

OlivierBlanvillain avatar Sep 21 '17 19:09 OlivierBlanvillain

I can try to handle some of these as a first contribution to the repo :)

rbraley avatar Sep 28 '17 06:09 rbraley

Hi Guys, just added my first PR with a typed substr column method. Let me know what you think about it

pgabara avatar Mar 06 '18 20:03 pgabara