bebe icon indicating copy to clipboard operation
bebe copied to clipboard

Functions that are in Spark SQL and not in Scala API to be implemented

Open MrPowers opened this issue 3 years ago • 3 comments

We can use this issue to create a list of all the functions that are in Spark SQL, but not in the Scala API for whatever reason.

Here's the list that @nvander1 sent me so we can get started out. He already implemented approx_percentile, so we're on our way!

  • [x] approx_percentile
  • [X] cardinality https://github.com/MrPowers/bebe/pull/24
  • [ ] character_length https://github.com/MrPowers/bebe/pull/25 - in scala - python API is called length
  • [ ] char_length - not sure if we need this if we have character_length - in scala - python API is called length
  • [ ] chr https://github.com/MrPowers/bebe/pull/26
  • [X] cot https://github.com/MrPowers/bebe/pull/28
  • [X] count_if https://github.com/MrPowers/bebe/pull/29
  • [ ] count_min_sketch
  • [ ] cube - in scala API is an alternative to the method group_by of Dataset
  • [ ] current_database - can be obtained from sparkSession.catalog.currentDatabase
  • [ ] date - called to_date
  • [ ] date_part - dayofweek, dayofyear, second, timestamp_seconds, etc
  • [ ] day - dayofmonth
  • [ ] decimal - cast(DecimalType)
  • [ ] div - /
  • [ ] double - cast(DoubleType)
  • [X] e
  • [ ] elt - not needed, can use regular Array indexing to fetch items
  • [ ] every - not needed, can use forall
  • [ ] extract - its a router to dayofweek, dayofyear, second... and the element to extract can't be a expression.
  • [ ] find_in_set - not needed, can use Scala functions
  • [ ] first_value - alias of first
  • [ ] float - cast(FloatType)
  • [ ] if - not needed, use when
  • [X] ifnull
  • [ ] in - can use array_contains
  • [X] inline
  • [ ] inline_outer
  • [ ] input_file_block_length
  • [ ] input_file_block_start
  • [ ] int - cast(IntegerType)
  • [ ] isnotnull - method of column in scala API
  • [ ] java_method
  • [ ] last_value - alias of last
  • [ ] lcase - lower
  • [X] left
  • [X] like - method of column
  • [ ] ln - column method isin
  • [X] make_date -
  • [ ] make_interval - this one was added to Spark 😎
  • [X] make_timestamp
  • [ ] max_by
  • [ ] min_by
  • [ ] mod - alias of %
  • [ ] named_struct
  • [ ] negative - - col
  • [ ] now - current_timestamp
  • [ ] nullif
  • [ ] nvl - coalesce
  • [X] nvl2
  • [X] octet_length
  • [ ] or - column method | and or
  • [X] parse_url
  • [X] percentile
  • [ ] pi
  • [ ] position - locate but maybe create an alternate locate that accepts all parameters as columns
  • [ ] positive
  • [ ] power - alias of pow
  • [ ] printf - format_string
  • [ ] random - alias of rand
  • [ ] reflect - not implemented in scala API but is an alias of SQL function java_method
  • [ ] replace -
  • [X] right -
  • [ ] like - method of column
  • [ ] rollup - in scala API is an alternative to the method group_by of Dataset
  • [X] sentences -
  • [ ] sha - alias of sha1
  • [ ] shiftleft - shiftLeft
  • [ ] shiftright - shiftRight
  • [ ] shiftrightunsigned - shiftRightUnsigned
  • [ ] sign - doesn't seem like a useful function
  • [ ] smallint - can use cast()
  • [ ] some - array_exists
  • [X] space - returns a string of n spaces, can be done with scala / python
  • [X] stack - PR: https://github.com/MrPowers/bebe/pull/21
  • [ ] std - Is the same of stddev_pop function?
  • [ ] string - cast(StringType)
  • [ ] str_to_map - in scala / python is much better to do a literal from a Map object
  • [X] substr - substring
  • [ ] timestamp - cast(TimestampType)
  • [ ] tinyint - can use cast
  • [ ] to_unix_timestamp - to_timestamp
  • [ ] typeof - scala / python can check the type from the schema
  • [ ] ucase - upper
  • [X] uuid
  • [ ] version - returns the spark version, can be obtained from the spark session
  • [X] weekday

MrPowers avatar Mar 05 '21 12:03 MrPowers

A few things that I'm not sure about, some of these functions are in the spark API, but maybe not in the org.apache.spark.sql.functions object

For example

  • character_length: is called in functions length
  • count_if: can be implemented with the spark API,but it can be a good addition to simplify
  • cube: if it's the function to aggregate, it's here and I don't see how an external function could improve it.
  • date: the definition of this function is a cast to date. Also there is the function to_date that has the same logic as the SQL, and also you can provide the date format to parse in yyyy-MM-dd format
  • date_part: there are a few functions that make this easier, like dayofweek, dayofyear, second, timestamp_seconds, etc
  • day: is called dayofmonth
  • printf: it's called format_string but force the first parameter to be a string, not a column
  • substring: it's created in functions package, but the start and end parameters only are exposed as Int, and don't accept columns. The idea here is to create an alternative version that accepts also the start and end as columns?

alfonsorr avatar Mar 08 '21 22:03 alfonsorr

@alfonsorr - good questions.

Feel free to update the list and just add something like "wont add" to the functions that shouldn't get implemented. This list was generated by a script @nvander1 wrote to compare the SQL functions and the Scala functions, so there are probably some that snuck in there that we don't need.

Yea, definitely want to define substring(str: Column, pos: Column, len: Column): Column. I don't like the functions that take regular Scala types as arguments.

Feel free to go ahead and add the "wont add" annotation to any functions you think we don't need.

MrPowers avatar Mar 09 '21 14:03 MrPowers

I've checked all the methods, and indicated the ones that are already implemented, or not useful in spark / python API

alfonsorr avatar Mar 09 '21 23:03 alfonsorr